Antimatroid, The

thoughts on computer science, electronics, mathematics

Posts Tagged ‘Segmentation

An Experiment in Optical Character Recognition

leave a comment »

Introduction

I’ve always been interested in machine learning and how it can be applied to a number of different problems. I spent some time during college learning some of the techniques used in machine learning, but since then I’ve gotten a bit rusty. So, I revisited the subject by doing some additional research and decided to try my hand at Optical Character Recognition (OCR)- the process of identifying characters within an image and producing a text output. There are a handful of traditional aspects to this process: image acquisition, segmentation, recognition and correction. Acquisition is about correcting an input image so that a segmentation process can be readily applied; segmentation identifies the characters in the image, recognition maps those visual characters to text characters, finally correction takes the text output and rectifies any errors. The following outlines my approach to segmentation and recognition.

Segmentation

Consider the following body of text from one of my earlier posts:

The goal is to extract each of the characters in the image. The simplest way to accomplish this is to implement an algorithm that reads the image much in the same way one might read a block of text: start at the top of the text and scan downward identifying all of the rows of text, then for each row, read all the characters from left to right, then identify words based on white space. Figuring out the word boundaries is done via a simple clustering process. Assuming we have an ordered set of rectangles produced by the first two steps, we can calculate the average distance between consecutive rectangles. Then, once this average has been produced, to then iterate over the list once more adding rectangles to words when the distance between consecutive rectangles is less than the average distance, then creating new words when the distance is exceeded.

This segmentation approach isn’t perfect as it assumes that we are dealing with evenly spaced characters of the same size from the same font. Of course, this isn’t always the case and we have things like kerning and ligatures to deal with. In this example two categories of problems arise: character combinations such as ay, ey and ly that are separable then combinations such as yw, rt and ct that are not separable end up being interpreted as a single character using this method. For the first category, I chose to divide rectangles whenever a line of characters has a single black pixel that does not have a black pixel neighboring ((x-1, y + a) | a <- [-1, 1]) it to the left. The second case isn't as clear cut as all the pixels neighbor one another, in principal one could implement a k-means clustering algorithm, but that assumes you know how many characters (k) you have in the image. I decided to allow the error to propagate through the system.

Recognition

Starting out, I knew that I wanted to use an Artificial neural network (ANN), so I spent some time reading Stuart’s and Norvig’s “Artificial Intelligence: A Modern Approach“, but felt that it wasn’t as comprehensive as I wanted, so I also read MacKay’s “Information Theory, Inference and Learning Algorithms,” which was more in tune with what I had in mind. I also came across a series (1, 2, 3) of pdf files hosted at Aberdeen’s Robert Gordon University that provided a more practical view of applying neural networks to pattern recognition.

A little background on ANNs: The general idea is that an individual neuron aggregates the weighted inputs from other neurons then outputs a signal. The magnitude of the signal is determined as a function of the aggregation called the activation function. Neurons are organized into layers, typically an input layer, one or more hidden layers and finally an output layer. Values from the input layer and feed into the hidden layer, then those outputs feed into the next and so on until all the values have gone through the output layer. The process of getting the network to map an input to an output is accomplished through training, in this case, a method known as Backpropagation. Given an input and an expected output, the input is feed through the network and produces an output. The difference between the two output vectors is the error that then needs to be feed backward through the network updating each node’s input weights such that the net error of the system is reduced. The method is effectively a gradient descent algorithm. I recommend reading the aforementioned references for details on how it all works. Following is my implementation of the Backpropagation algorithm:

using System;
using System.Linq;
using Library.Mathematics;

namespace OCRProject.Recognizers.NeuralNetworks {
    public class Neuron {
        Func<double, double> activationFunction;

        public Vector InputWeights { get; set; }

        public Neuron(Func<double, double> activationFunction) {
            this.activationFunction = activationFunction;
        }

        public double Output(Vector input) {
            return activationFunction(InputWeights.dot(input));
        }
    }

    public class NeuralNetwork {
        private Neuron[] hiddenLayer, outputLayer;

	...

        public Vector Output(Vector input) {
            Vector hiddenLayerOutput = new Vector(hiddenLayer.Length, (i) => hiddenLayer[i].Output(input));
            return new Vector(outputLayer.Length, (i) => outputLayer[i].Output(hiddenLayerOutput));
        }

        public Vector Train(Vector input, Vector desiredOutput, double learningRate) {
            Vector hOutput = new Vector(hiddenLayer.Length, (i) => hiddenLayer[i].Output(input));
            Vector oOutput = new Vector(outputLayer.Length, (i) => outputLayer[i].Output(hOutput));

            Vector oError = new Vector(
                oOutput.Dimension,
                (i) => oOutput[i] * (1 - oOutput[i]) * (desiredOutput[i] - oOutput[i])
            );

            for (int n = 0; n < outputLayer.Length; n++) {
                outputLayer[n].InputWeights = new Vector(
                    hiddenLayer.Length,
                    (i) => outputLayer[n].InputWeights[i] + learningRate * oError[n] * hOutput[i]
                );
            }

            Vector hError = new Vector(
                hiddenLayer.Length,
                (i) => hOutput[i] * (1 - hOutput[i]) * (oError.dot(new Vector(oError.Dimension, (j) => outputLayer[j].InputWeights[i])))
            );

            for (int n = 0; n < hiddenLayer.Length; n++) {
                hiddenLayer[n].InputWeights = new Vector(
                    input.Dimension,
                    (i) => hiddenLayer[n].InputWeights[i] + learningRate * hError[n] * input[i]
                );
            }

            return oError;
        }
    }
}

In terms of how all of this applies to OCR, I started out with all visible characters and produced a collection of 16×16 images. Each image contained a single character centered in the image. This image was then mapped to a 256 dimensional vector with its corresponding character mapped to an 8 dimensional vector representing the expected output. The question that remains is how many hidden layer units should be used. To figure this out, I conducted a small experiment:

Hidden Units 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Minimum (%) 0 4 7 21 36 48 69 80 87 85 92 88 92 89 93 93
Average (%) 1 4 12 28 44 56 72 83 89 90 94 93 94 93 95 95
Maximum (%) 1 6 15 31 51 62 75 87 91 93 99 96 98 95 98 99

Each run consisted of 10 trials with each trial taking 5000 iterations to train the network. (Normally, one can measure the Mean squared error of the network and halt the training process once it is sufficiently small.) After doing this testing, I found that ANNs with 11 hidden units looked to give the highest accuracy with the fewest number of units. Given the original text that was used, the following text was produced:

As expected, the category of errors that were identified earlier (character combinations ff, rt, ct) were not segmented correctly and consequently not recognized correctly.

Wrap-up

There is a lot of additional work that could be thrown at the project. In the future, it’d be good to modify the solution to accept any sized font as well as any font, adding support for images containing scanned or photographed text rather than computer generated images and some error correction on the output to flag parts of the text that require review. I suspect that I will continue down this road and start investigating how these methods can be applied to general computer vision for some planned robotics projects.

Advertisements