When theory meets practice: A computer vision case
Hi everybody ☺ ! My name is Pavle Milosevic, and I am working at University of Belgrade, Faculty of Organizational Sciences (FON) as Assistant Professor, and at Blinking as Biometric Expert. At FON, I am teaching Systems Theory, as well as some Computational Intelligence methods, e.g. neural networks and fuzzy logic. Also, my research areas of interest are closely related to these fields. At Blinking, I am employed in the Research Department that deals with biometric technologies, computer vision techniques, machine and deep learning models, etc. One of my main responsibilities is to utilize the latest theoretical and practical findings to employ, adapt and create state-of-the-art approaches that may fit our needs.
Although faculty is considered to be the best preparation for the job in industry, the tasks and problems on these two stages differ greatly. Problems at undergraduate level are often standardized and limited on usage of pre-defined structures. Thus, thinking outside-the-box and observing the problem from different angles is not needed to solve them in a satisfactory manner. Also, students are instructed to study some traditional, even obsolete models, as well as novel algorithms. In other words, students get a well formed basis for solving practical problems during their education, but with certain practical limitations.
In real-life machine learning tasks, it is common that collected dataset or pre-trained model does not meet company’s (e.g. Blinking’s) requirements. This is the first and one of the most difficult problems that people from academia and/or young graduates encounter. For instance, the majority of facial image datasets are obtained in laboratory conditions, or “in the wild”. The first ones are not complete from the Blinking’s perspective, since they do not contain all scenarios relevant to face verification use case, e.g. images in daylight and in dark are omitted. “In the wild” datasets usually contain some images that are irrelevant for our use case, e.g. face images with very heavy motion blur, faces in extreme poses, low resolution images, etc.
When light meets Computer Vision
In this post, I will pay (some) attention to practical problems caused by strong/partial illumination or occlusion problems to face verification and optical character recognition (OCR) problems. Before explaining the observed problem, I have to provide brief definitions of algorithms that deal with listed problems. Face verification algorithm compares a candidate face to another face from the database, and verifies whether it is a match or not. OCR algorithm performs conversion of images with text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo, etc.
Ok, we can move problems. In practice, there is an evident degradation of the face verification algorithm performance under the extreme illumination conditions. Some examples of grayscale facial images taken from literature are given at Fig. 1. As it can be seen, some faces are hard to distinguish/recognize even by the human eye.
Figure 1. Facial images with various illumination (Tan & Triggs, 2010)
Similarly stands for OCR problem, i.e. when dark shadow covers a text field, it is challenging to OCR algorithm to properly detect and read all characters. Blur or low image quality may also be quite a challenge.
Figure 2. Yes, some algorithms can read this easily
When I am defining a research task for students, I will often instruct them to ‘clean’ the dataset by removing the problematic instance. In these cases the big picture is more important than the specific issue. However, the problematic instances are in the focus of our research activities in Blinking since the others are ‘easy’ to solve. In other words, we have to do something to make them ‘easy’ to solve, too.
Image preprocessing is an extremely important phase when dealing with both face verification and OCR tasks. Preprocessing methods include various computer vision techniques, such as image resizing, rotation, conversion in various formats, normalization, thresholding, etc. Image normalization and thresholding will be the subject of this post since they are used to eliminate the effects of different lighting conditions in order to emphasize the most important elements of the image (letters/numbers while performing OCR task, or eyes, and different facial features in face recognition).
Primary task of image normalization algorithms is to eliminate the effects of different illumination, as well as shadow and glare, etc. Some of the most well-known algorithms used for this purpose are Gross-Brajovic preprocessing, Logarithmic Total Variation, Tan-Triggs preprocessing, etc. These procedures often include some filtering, equalization and masking techniques (e.g. Gaussian blur, difference of Gaussians, histogram equalization, gamma correction, contrast equalization, facial hair masking, etc.). And most of these techniques, more or less, are presented at faculties.
On the other hand, thresholding is an image segmentation technique that tries to distinct the foreground from the background in the image. Thresholding is performed on one-channeled image, usually grayscale images, by comparing each pixel value with threshold. In literature, there are numerous techniques for threshold estimation, e.g. Otsu thresholding, Huang thresholding, etc. Also, we may have distinct global thresholding methods that are performed on the whole image, and local/adaptive methods that estimate different thresholds for different parts of an image. Anyway, students often choose their thresholds by iterative search, in order to maximize reading performance.
It can be often heard that machine learning is not a magic, as well as ‘garbage in, garbage out’ problem. However, in the last 10 years, deep neural networks (DNN) shook existing paradigms and clichés at the certain extent. The way from AlexNet (8 layers deep, year 2012.) too much powerful DNNs, for instance, DenseNet-201 (201 layers deep, year 2018.) was very interesting to follow and study. Namely, DNN are successfully used for image classification, image segmentation and object detection, even surpassing humans in some of them, e.g. face verification task. In recent years, there is an ongoing trend of developing mini DNN architectures that outperform their much bigger predecessors. Nowadays, it is impossible to imagine modern face verification systems without a mighty DNN feature extractor.
DNN architectures play a vital role in modern OCR engines. Take the well-known open source OCR engine Tesseract has been developing for more than 35 years for example. From the Tesseract 4 version, this engine utilizes a long short-term memory network (LSTM), a deep recurrent neural network particularly useful for working with sequences of data. Also, the majority of modern OCR solutions specialized for handwriting recognition, scene text recognition, etc. are DNN-based.
Sometimes Less Is More
So, why are we talking about preprocessing methods when DNNs are so mighty? Well, in some cases, less is more. There are real-life situations when it is important not to be absolutely accurate, but to be extremely fast or easily interpreted. Also, there are situations when there is not enough space on device for colossal DNN models. And, especially important, there are situations when you need best from both standpoints.
No matter how good our training dataset is, it is not unusual to capture some images which differ significantly from others. The most common solution to this issue is to retrain DNN with additional, specific data. However, this procedure is time (and sometimes money) consuming and often not easy to perform in short deadlines. To avoid retraining DNNs every time when some problematic situation is detected, we may try to help our DNNs by feeding them with examples similar to what DNNs have ‘seen’. In other words, we may preprocess or threshold some images (e.g. outliers) or all of them in order to feed DNN with something familiar. This trick is particularly useful with not-so-deep architecture, e.g. DNN for OCR. Finally, the idea of this post is not to present some ‘cheap’ trick, known to many, but to indicate how some ‘old’, academic knowledge may be useful to boost the performance of the state-of-the-art model and to underline synergy of theory and practice ☺
References (don’t be afraid to take a peek)
- Breuel, T. M., Ul-Hasan, A., Al-Azawi, M. A., & Shafait, F. (2013). High-performance OCR for printed English and Fraktur using LSTM networks. In 2013 12th international conference on document analysis and recognition (pp. 683-687).
- Chen, T., Yin, W., Zhou, X. S., Comaniciu, D., & Huang, T. S. (2006). Total variation models for variable lighting face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1519-1524.
- Gross, R., & Brajovic, V. (2003). An image preprocessing algorithm for illumination invariant face recognition. In Proceedings of the International Conference on Audio-and Video-Based Biometric Person Authentication (pp. 10-18).
- Gupta, M. R., Jacobson, N. P., & Garcia, E. K. (2007). OCR binarization and image pre-processing for searching historical documents. Pattern Recognition, 40(2), 389-397.
- Huang, Z. K., & Chau, K. W. (2008). A new image thresholding method based on Gaussian mixture model. Applied Mathematics and Computation, 205(2), 899-907.
- Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815-823).
- Smith, R. (2007). An overview of the Tesseract OCR engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) (vol. 2, pp. 629-633).
- Sun, Y., Chen, Y., Wang, X., & Tang, X. (2014). Deep learning face representation by joint identification-verification. Advances in Neural Information Processing Systems, 27,
- Tan, X., & Triggs, B. (2010). Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Transactions on Image Processing, 19(6), 1635-1650.
- Yu, D., Li, X., Zhang, C., Liu, T., Han, J., Liu, J., & Ding, E. (2020). Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12113-12122).