MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding

6 min readApr 8, 2021

The great sinologist Zhou Youguang, also known as the ‘father of Pinyin’, once said “Language distinguishes human beings from animals, writing distinguishes civilization from barbarism, and education distinguishes advancement from regression.” [1]. The invention of texts is arguably the most significant advancement of humankind. Not only has it fundamentally transformed the way human knowledge is encoded and disseminated, but it has also allowed semantic information to be expressed in a concise and accurate fashion. The omnipresence of texts ranges from road signs to posters on streets and even product packages. As a result, the process of detecting, recognizing and deciphering textual information from images is of paramount importance in many real-life settings.

Recently, we have open-sourced MMOCR as an OpenMMLab member– a toolbox which covers more than ten most common models for text detection, text recognition as well as other downstream tasks including key information extraction (KIE). In particular, we have organized the models for text detection, text recognition and KIE into a single, standardized framework, since they are often used in conjunction with one another in most practical applications.

Please check out the MMOCR at: https://github.com/open-mmlab/mmocr

MMOCR has the following unique features:

Comprehensive Pipeline: The toolbox supports not only text detection and text recognition, but also their downstream tasks such as key information extraction.
Multiple Models: The toolbox supports more than ten excellent image text recognition models, covering single and two-stage text detection, regular and irregular text recognition, as well as key information extraction based on graph convolutional models.
Modular Design: MMOCR adopts a consistent framework and modular design. This not only endows the reusability of existing code, but it also enables users to construct their own customized models. In particular, we have abstracted the backbone, neck, head and loss module from the network architecture of text detection, segmentation-based recognition and key information extraction models, and likewise the backbone, encoder, decoder and loss module from the seq2seq -based text recognition models.
Fair comparison: Up till now, image text recognition methods have differed widely in their utilities such as training datasets, pre-trained models, data augmentation methods, backbones, optimizers and learning rate strategies. For example, different text detection apporaches often use different combinations of random resizing, rotation and cropping for data augmentation. The wide variety of utilities for various methods renders the investigation of effective components difficult. The consistent framework and modular design of MMOCR enables users to combine various modules easily, thereby achieving a fair comparison between models in ways that are unparalleled by conventional methods.
Easy to use: We have standardized the labelling format of some of the most common academic datasets, and provided their corresponding annotation files which are ready to be used in MMOCR. Along with the numerous pre-trained models, the comprehensive documentations and benchmarks, our MMOCR toolbox can be quickly mastered by any user.

In addition, we would like to emphasize that the MMOCR toolbox is not exclusive for research purposes, but it is also ideal for educational and industrial uses. In the future, we will continue to equip MMOCR with more algorithms and models trained on different languages, and we value any kind of contribution along the way.

For now, let’s walk through each part of MMOCR and understand it in more detail.

Text Detection

The first step towards image text recognition is text detection. In general, text detection can be categorized into multi-directional and arbitrary-shaped text detection, both of which are covered by MMOCR. MMOCR has successfully reimplemented some of the most recent and novel text detection models, as shown below:

Two-Step Algorithm

MaskRCNN [2]

Single-Step Algorithms

PANet [3]
PSENet [4]
DBNet [5]
TextSnake [6]

Text Recognition

The next step is text recognition, in which the textual information of a 2-D image is converted into 1-D literal strings. Not only has MMOCR reimplemented the classical CRNN model [7], but it also covers the recent seq2seq models including the SAR (an algorithm based on encoder-decoder and 2D attention mechanism) [8], the position-enhanced RobustScanner [9], Transformer-based method [10], as well as other fundamental segmentation-based text recognition methods.

The CRNN model, which is based on the CTC loss and is primarily designed for regular text recognition, has been widely used in industry thanks to its phenomenal efficiency. On the other hand, models which are based on either transformer, segmentation or attention mechanism (e.g. SAR and RobustScanner) distinguish themselves with their novel performance in irregular text recognition.

Key Information Extraction (KIE)

Key information extraction (KIE) is one of the most important and common downstream tasks in image text recognition. While text recognition can output the corresponding text for a given predicted bounding box, most practical applications require the outputs to be organized in a structured, semantically informative manner. For instance, when applying text recognition to an image of a receipt, it is useful to extract information such as the time and place at which the purchase was made, the details of the products purchased, and the total amount payable. Conventional KIE methods based on template matching are highly unrobust, and require one to customize a specific template for each document layout. In light of this, our MMOCR supports the recently-proposed Spatial Dual-Modality Graph Reasoning (SDMG-R) model [11]. SDMG-R utilizes the spatial relations between neighboring text regions and the visual and textual features of detected text regions to achieve end-to-end KIE through a deep learning neural network based on dual-modality graphs of document images.

Conclusions

As part of the OpenMMLab project, we will continue to improve MMOCR by fixing various issues as well as adding new features, and we highly appreciate any kind of contribution along the way. We will also release more exciting OpenMMLab projects in the future, and we sincerely hope you will continue to follow us along this journey! Below is a brief summary of the members of OpenMMLAB.

MMCV: OpenMMLab foundational library for computer vision.
MMClassification: OpenMMLab image classification toolbox and benchmark.
MMDetection: OpenMMLab detection toolbox and benchmark.
MMDetection3D: OpenMMLab’s next-generation platform for general 3D object detection.
MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark.
MMAction2: OpenMMLab’s next-generation action understanding toolbox and benchmark.
MMTracking: OpenMMLab video perception toolbox and benchmark.
MMPose: OpenMMLab pose estimation toolbox and benchmark.
MMEditing: OpenMMLab image and video editing toolbox.

References

[1] http://edu.people.com.cn/n1/2019/0114/c1006-30526041.html

[2] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross B. Girshick: Mask R-CNN. ICCV 2017: 2980–2988

[3] Wenhai Wang, Enze Xie, Xiaoge Song, Yuhang Zang, Wenjia Wang, Tong Lu, Gang Yu, Chunhua Shen: Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network. ICCV 2019: 8439–8448

[4] Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, Shuai Shao: Shape Robust Text Detection With Progressive Scale Expansion Network. CVPR 2019: 9336–9345

[5] Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, Xiang Bai: Real-Time Scene Text Detection with Differentiable Binarization. AAAI 2020: 11474–11481

[6] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, Cong Yao:

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. ECCV (2) 2018: 19–35

[7] Baoguang Shi, Xiang Bai, Cong Yao: An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition. In TPAMI, volume 39, pages2298–2304. 2017.

[8] Hui Li, Peng Wang, Chunhua Shen, Guyu Zhang: Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition. AAAI 2019: 8610–8617.

[9] Xiaoyu Yue, Zhanghui Kuang, Chenhao Lin, Hongbin Sun, Wayne Zhang: RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition. ECCV 2020: 135–151

[10] Sheng, Fenfen and Chen, Zhineng and Xu, Bo, NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition. IEEE2019: 781–786

[11] Hongbin Sun, Zhanghui Kuang, Xiaoyu Yue, Chenhao Lin, Wayne Zhang: Spatial Dual-Modality Graph Reasoning for Key Information Extraction.

MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding

Written by OpenMMLab

No responses yet