Character Recognition - Chars74k
Devendra Pratap Yadav
Objective
The aim is to explore methods for recognition of segmented handwritten characters from corresponding image files. This problem comprises solving three major sub-problems, which are image pre-processing, extracting relevant features from the images, and learning an Artificial Neural Network model to classify the data points into their respective classes. This is better represented by the flowchart below
In this report, we intend to describe the algorithm that we will be using for solving each of the above sub-problems, and illustrate the results of the partial work done in this regard.
We will be using the Chars74K Image Dataset (consisting of all English alphabet letters in uppercase as well as lowercase , along with the digits 0 9), to perform the experiments.
Image Pre-processing
The aim of this step is to obtain clear character images devoid of any noise of a fixed dimension.
● Thresholding - Although most of the images in the dataset are binary images, we explicitly binarised those which were not, using Otsus thresholding. The foreground is represented by white pixels (Value = 255).
● Image Denoising - We then apply an appropriate median filter to the binary character image to remove any kind of salt and pepper noise, since our algorithm deals with the calculation of gradient vectors to detect character boundaries. Salt and pepper noise can cause the algorithm to report false positives and result in generation of noisy data. Some special cases arise when the lines in the images are very thin, in which case the median filter completely obliterates large chunks of lines. We used RMSE as a measure to make sure this does not happen.
● Bounded Box - The third step is to obtain a bounded box containing only the main character, removing any black border around it. We then resize all of the snipped images to get images of size 32 X 32. This gives us a uniform set of images, each of the same dimension.
Feature generation
We break down the entire image into sub blocks of equal size also called zoning (similar to the process follows in JPEG compression). Each sub-block is considered as a separate entity, and has its own set of features. The features for the image as a whole will be the features of its individual sub-blocks arranged linearly.
We can observe that the rather complex looking shape of the character as a whole can be made simple by considering is as being made of many small individual strokes of line and their intersections. An appropriate sized sub-block will capture only a very small part of the character, which will look like small line strokes. This motivates the sub division of the image.
We intend to use certain metrics in our implementation, namely the number of distinct gradient orientations present, and the type of different gradient orientations.
To calculate the aforementioned metrics, we use a process similar to the technique called Histogram of Oriented Gradients. We select a particular sub-block, and for every pixel, we calculate the gradient (magnitude and direction) at that point. Assume that the gradient can be quantized into cardinal directions, which are 0, +45, +90, +135, +180, +225, +270, +315 degrees, we classify that gradient into its respective bin. After repeating this process for the entire block, we are left with 8 gradient buckets and the value contained in it. These are the features that we will be feeding into the neural network.
Gradient Calculation
Reference HOG Person Detector Tutorial by Chris McCormick
http://mccormickml.com/2013/05/09/hog-person-detector-tutorial/
Histogram of Oriented Gradients
This algorithm was first published in the paper Histograms of Oriented Gradients for Human Detection by Navneet Dalal and Bill Triggs, in 2005. It has become very popular in applications related to object detection in images since then. Lets see a brief walkthrough of the algorithm and the motivation behind it.
This algorithm tackles the problem of effectively describing an image in terms of certain extracted features. Let us consider an image containing an object, may be a digit, like the one given below :
At the first glance, it seems very difficult to objectively describe the above image, given the complex curves and edges. The intuition behind this algorithm lies in the fact that if we zoom into the image and consider a very small part of it, we will see short straight lines instead of curved ones. All similar images having a handwritten capital 'N' will yield similarity between respective sub-images, in terms of the main directions in which the lines lie. HoG assumes that there is a constant number of cardinal directions in which a line can lie, within a sub-image block. The details of the algorithm is given on the next page.
Algorithm :
● Compute the first order gradient at every pixel of the image using Sobel/Prewitt or any other suitable operator. We have used the inbuilt OpenCV function Sobel(params) calculate the gradients. This Sobel function combines Gaussian smoothing and differentiation.
● Divide the entire image into 8 X 8 blocks. Each sub-image thus obtained is called a cell. Extra padding is done to ensure that the number of rows and columns are multiples of 8.
● Consider a particular cell. Compute the gradient magnitude and angle at every point of the cell. Using the HoG assumption of cardinal directions, we define 9 bins - 10, 30, 50, 70, 90, 110, 130, 150, 170 degrees. Each gradient shall fall into at most two of these bins. For intermediary gradients, we proportionally divide its magnitude into the two nearest bins and add the corresponding contributions to those bins. Consider a gradient of angle = 15 degree wrt x-axis and magnitude = 40. Three fourths of its magnitude i.e. 30 will be added to bin 10 and the rest to the bin 30. In this way, we get a histogram of the cardinal directions, which may have fractional values. Note that these values are independent of the length of the line in that cell. We have used the same angle bins as described in the paper by Dalal and Triggs.
Histogram of oriented Gradients
Ref - https://gilscvblog.com/2013/08/18/a-short-introduction-to-descriptors/
● Histogram Normalization - It can be easily seen that any changes to the brightness of the image can result in significant variations to the magnitudes of the gradients. It is imperative to normalize the feature vector of each cell i.e. the values for the histogram. We can follow a simple divide by magnitude of the vector approach to do this. The original paper, however, describes a different approach to do so. It considers a square sliding window of 2 X 2 cells over the image. We concatenate the histograms of the 4 cells in the window, to obtain a new vector V of size 36. V is then normalized by dividing it with its magnitude and appended at the end of the final feature vector. The special thing about this approach that the sliding window slides by just 1 unit along either direction in the next iteration, causing an overlap of two cells. The following image demonstrates this operation.
Histogram Normalization
Ref - http://mccormickml.com/2013/05/09/hog-person-detector-tutorial/
● After repeating the above step for all the blocks, we get a feature vector of size = (Total number of blocks) * ( 4 * number of bins ) . This feature vector can now be fed into an appropriate machine learning model for classification.
Matlab describes an inbuilt function extractHOGFeatures, which directly returns the feature vector of the image. OpenCV also has an inbuilt class named HOGDescriptor, to provide this functionality.
We have used our own implementation of HoG for the purpose of feature extraction. The results of our code are given on the following page.
Sample Images with dominant gradients in each cell
Input image HoG visualization
Advantage of HoG :
● Other methods pose challenges like single response edge detection, thinning, etc. even before we start with the machine learning part. Also, even the best edge detection algorithms may produce varying results for very similar images. Other algorithms are sensitive to bad edge detection, and require even more processing for edge corrections. HoG, on the other hand, does not require edge detection as a pre-processing step.
● HoG is invariant to changes in brightness effects.
● HoG produces less number of features than other algorithms, since it does not take into account the length of an edge in cell. This reduces the dimensionality of the dataset and improves performance of the machine learning algorithm.
● HoG can also be applied to other tasks like object detection in images. It was in fact first described for this purpose itself.
Neural Network
After performing the above steps, we will be left with a dataset containing some data points. The neural network will have 62 (26 + 26 + 10) or 10 outer layer nodes, depending upon whether we use the MNIST dataset containing only the digits or the Chars74K dataset containing all characters. We can have variable number of hidden layers and number of nodes in each hidden layer. This will constitute the part where tune our machine learning model.
The neural network will be implemented in Python using Scikit-Learn library. The input to the model will a text file containing the feature vectors for all of the data points.
Final Results
We use various subsets of the Chars74k dataset and compare
the results of using HoG versus direct pixel values as input to Neural Network.
We use both Scikit-Learn(Python) and MATLAB to obtain the results.
Scikit-learn gives better accuracy as we can tune more parameters for the
Neural Network.
All datasets are trained using 80% train, 20% test split. We use 15% of data points for validation while training.
We compare the following two types of features that are used
as input to NN:
1. Feed the normalized [0,1] pixel values directly. 1024 input per data point.
2. Use Histogram of Gradients to create 324 input feature vector.
Datasets
1. Images of
Characters(extracted from real world photos)
MATLAB
Structure of Neural Network used
Input |
Accuracy (Average of 5 tests) |
Pixel Values ( scaled to [0,1] ) |
53.06 % |
Histogram of Gradients |
72.48 % |
SCIKIT-Learn (Python)
Structure of Neural Network used
Input |
Accuracy (Average of 5 tests) |
Pixel Values ( scaled to [0,1] ) |
70.67 % |
Histogram of Gradients |
76.43 % |
2. Handwritten Characters (sampled from 55 people)
MATLAB
Structure of Neural Network used
Input |
Accuracy (Average of 5 tests) |
Pixel Values ( scaled to [0,1] ) |
64.76 % |
Histogram of Gradients |
77.18 % |
SCIKIT-Learn (Python)
Structure of Neural Network used
Input |
Accuracy (Average of 5 tests) |
Pixel Values ( scaled to [0,1] ) |
75.72 % |
Histogram of Gradients |
80.31 % |
Conclusion
We observe that the Histogram of Gradients method provides an accurate classification compared to direct pixel values.
Histogram of Gradients correctly identifies features for both uppercase and
lowercase characters.
Some advanced models for recognition of Chars74k stand at 86% accuracy using
Convolutional Neural Networks. We achieve accuracy close to 76% for the whole
data set (averaged) using traditional Neural Network and Histogram of Gradients
feature extraction.