Antimatroid, The

thoughts on computer science, electronics, mathematics

Archive for the ‘Artificial Intelligence’ Category

Category Recognition of Golden and Silver Age Comic Book Covers

leave a comment »



For a while now, I’ve been wanting to work on a computer vision project and decided for my next research focused project that I would learn some image processing and machine learning algorithms in order to build a system that would classify the visual contents of images, a category recognizer. Over the course of the summer I researched several techniques and built the system I had envisioned. The end result is by no means state of the art, but satisfactory for four months of on-and-off development and research. The following post includes my notes on the techniques and algorithms that were used in the project followed by a summary of the system and its performance against a comic book data set that was produced during development.

Subject Matter

The original subject matter of this project were paintings from the 1890s done in the Cloisonnism art style. Artists of the style are exemplified by Emile Bernard, Paul Gaugin and Paul Serusier. The style is characterized by large regions of flat colors outlined by dark lines; characteristics that would be easy to work with using established image processing techniques. During development, it became evident that no one approach would work well with these images. As an alternative, I decided to work with Golden and Silver Age comic book covers from the 1940s to 1960s which also typified this art style. Many of the comic books were drawn by the same individuals such as Jack Kirby, Joe Shuster and Bob Kane. As an added benefit, there are thousands of comic book covers available online compared to the dozens of Cloisonnism paintings.

Image Processing


An image is a function, I : \mathbb{Z}^2 \to \mathbb{Z}^3, where each input vector, \vec{x}, represents an image coordinate and each output vector, \vec{y}, represents the red, blue and green (RGB) channels, \vec{y}_c, of an image. Individual input values are bound between zero and the width, w, or height, h, of the image and output values are between zero and 255. Each output vector represents a unique color in RGB space giving rise to 2^{24} possible colors. Below is a basic sample image broken down into to its individual channels.

From left to right: original image, red channel, blue channel and green channel.
Original image: Action Comics 1 (1938). Art by Joe Shuster. Copyright © 1938 DC Comics, Inc.

Like any other vector field, transformations can be applied to the image to yield a new image, \hat{I}. In image processing, these transformations are referred to as image filters and come in three varieties of point-based, neighbor-based and image-based filters. As the names suggest, point-based filters map single output vectors to a single output vector, neighbor-based filters map neighboring output vectors to a single output vector, and image-based filters map the whole image and a single or neighboring set of output vectors to a single output vector.

There are many different instances of these types of filters, but only those used in this project are discussed below. Computational complexity and efficient algorithms for each type of filter are also discussed where appropriate.

Point-based Filters

Point-based filters, f : \mathbb{Z}^3 \to \mathbb{Z}^3 , map an output vector to a new output vector in the form \hat{I}(\vec{x}) = f(I(\vec{x})). Application of a point-based filter is done in quadratic time with respect to the dimensions of the image \mathcal{O}(N^2).

Grayscale Projection

It is helpful to collapse the RGB channels of an image down to a single channel for the purpose of simplifying filter results. This can be done by using a filter of the form f(\vec{y})_c = \frac{ \lVert \vec{y} \rVert_2 }{ \sqrt{3} }. Alternatively one can use a filter of the form f(\vec{y})_c = (0.2126, 0.7152, 0.0722)^T \cdot \vec{y} to represent the luminescence of the output vector.


A threshold filter serves as a way to accentuate all values in the image greater than or equal to a threshold, \gamma, or to attenuate all those values less than the threshold.

The first variety is the step threshold filter, f(\vec{y})_c = \begin{cases} 255 & \vec{y}_{c} \ge \gamma \\ 0 & \text{otherwise} \end{cases}, which exhibits an ideal threshold cutoff after the threshold value.

The second variety is a logistic threshold filter, \displaystyle f(\vec{y})_c = \frac{255}{ 1.0 + \exp{\sigma(\gamma - \vec{y}_c)} }, with an additional parameter, \sigma, allowing for wiggle room about the threshold yielding a tapered step function as \sigma increases in size.

From left to right: original image, step threshold filter (\gamma = 0.5), logistic threshold filter (\gamma = 0.5, \sigma = 4), and logistic threshold filter (\gamma = 0.5, \sigma = 16).
Original image: The Amazing Spider-Man 1 (1963). Art by Jack Kirby & Steve Ditko. Copyright © 1963 Non-Pareil Publishing Corp.

Neighbor-based Filters

All neighbor-based filters take the output vectors neighboring an input vector to calculate a new output vector value. How the neighboring output vectors should be aggregated together is given by a kernel image, K, and the computation is represented as a two-dimensional, discrete convolution.

\hat{I}_c = \displaystyle (I \star K)(u,v)_c = \sum_{i=-\infty}^{\infty} \sum_{j=-\infty}^{\infty} I(i,j)_c K(u-i,v-j)_c

Neighbor-based filters can be applied naïvely in quartic time as a function of the image and kernel dimensions, \mathcal{O}(N^4). However, a more efficient implementation allows for \mathcal{O}(N^2 \log_2 N) time by way of the Discrete Fourier Transform.

\displaystyle \mathcal{F}(f)(x) = \sum_{n=0}^{N-1} f(n) \exp{ -2 \pi i \frac{n}{N} x}

The Discrete Fourier Transform is a way of converting a signal residing in the spatial domain into a signal in the frequency domain by aggregating waveforms of varying frequencies where each waveform is amplified by its corresponding value in the input signal. The Inverse Discrete Fourier Transform maps a frequency domain signal back to the spatial domain.

\displaystyle \mathcal{F}^{-}(f)(x) = \frac{1}{N} \sum_{n=0}^{N-1} f(n) \exp{ 2 \pi i \frac{n}{N} x}

Applying the Discrete Fourier Transform to a convolution, \mathcal{F}(f \star g), comes with the convenient property that the transformed convolution can be rewritten as the product of the transformed functions, \mathcal{F}(f) \mathcal{F}(g), by way of the Convolution Theorem.

The improved time complexity is achieved by using a divide a conquer algorithm known as the Fast Fourier Transform which takes advantage of the Danielson-Lanczos Lemma which states that the Discrete Fourier Transform of a signal can be calculated by splitting the signal into two equal sized signals and computing their Discrete Fourier Transform.

\displaystyle \mathcal{F}(f)(x) =  \sum_{n=0}^{\frac{N}{2} - 1} f(2n) \exp{ -2 \pi i \frac{2n}{N} x} + \sum_{n=0}^{\frac{N}{2} - 1} f(2n+1) \exp{ -2 \pi i \frac{2n+1}{N} x }

\displaystyle \mathcal{F}(f)(x) =  \sum_{n=0}^{\frac{N}{2} - 1} f(2n) \exp{ -2 \pi i \frac{n}{N / 2} x} + \exp{ -2 \pi i \frac{x}{N}  } \sum_{n=0}^{\frac{N}{2} - 1} f(2n+1) \exp{ -2 \pi i \frac{n}{ N / 2 } x }

\displaystyle \mathcal{F}(f)(x) = \mathcal{F}(f_{even})(x) + \mathcal{F}(f_{odd})(x) \exp{ -2 \pi i \frac{x}{N}  }

For the purposes of image processing, we use the two-dimensional Discrete and Inverse Discrete Fourier Transform.

\displaystyle \mathcal{F}(f)(u, v) = \frac{1}{w h} \sum_{m = 0}^{w - 1} \sum_{n = 0}^{h-1} f(m,n) \exp{-2 \pi i \left (\frac{m u}{h} + \frac{nv}{w} \right) }

The expression can be rearranged to be the Discrete Fourier Transform of each column in the image and then computing the resulting Discrete Fourier Transform of those results to obtain the full two-dimensional Discrete Fourier Transform.

\displaystyle \mathcal{F}(f)(u, v) = \frac{1}{w} \sum_{m = 0}^{w-1} \left (  \frac{1}{h} \sum_{n = 0}^{h-1} f(m,n) \exp{-\frac{2 \pi i n v}{h}}  \right) \exp{-\frac{2 \pi i m u}{w}}

As a result, we can extend the Fast Fourier Transform in one dimension easily into two dimensions producing a much more efficient time complexity.

Weighted Means: Gaussian and Inverse Distance

Weighted mean filters are used to modify the morphology of an image by averaging neighboring output vectors together according to some scheme.

A Gaussian filter is used to blur an image by using the Gaussian distribution with standard deviation, \sigma, as a kernel.

\displaystyle K(u, v)_c = \frac{1}{2 \pi \sigma^2} \exp{-\frac{u^2+v^2}{2 \sigma^2} }

The inverse distance filter calculates how far the neighboring output vectors are with respect to the new output vector being calculated. Each result is also scaled by the parameter, \gamma, allowing for contrast adjustments.

\displaystyle K(u, v)_c = \begin{cases}  \gamma  \lVert (u-i, v-j) \rVert^{-}  & (u,v) \neq (i,j)  \\ 0 & \text{otherwise} \end{cases}


A Laplacian filter detects changes in an image and can be used for sharpening and edge detection. Much like in calculus of a single variable, the slope of a surface can be calculated by the Gradient operator, \displaystyle \nabla = \left ( \frac{\partial}{\partial x}, \frac{\partial}{\partial y} \right ). Since it is easier to work with a scalar than a vector, the magnitude of the gradient is given by the Laplacian operator, \displaystyle \Delta = \nabla \cdot \nabla = \frac{\partial^2}{\partial x^2} + \frac{\partial^2}{\partial y^2}.

Since an image is a discrete function, the Laplacian operator needs to be approximated numerically using a central difference. h represents the spacing between successive samples of the underlying function. Since the finest resolution that can be achieved in an image is an individual displacement, h = 1.

\displaystyle \Delta I \approx  \frac{I(x + h, y) + I(x, y + h) - 4 I(x,y) - I(x - h, y) + I(x, y - h)}{h^2}

\displaystyle \Delta I \approx I \star K = I \star \frac{1}{h^2} \begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix}

From left to right: original image, Gaussian filter (\sigma = 4.0), inverse distance filter (\gamma = 2.0) and Laplacian filter.
Original image: Batman 1 (1940). Art by Bob Kane. Copyright © 1940 DC Comics, Inc.

Image-based Filters

Image-based filters calculate some information about the contents of the image and then use that information to generate the appropriate point-based and neighbor based filters.


The normalization process computes the minimum, \vec{y}_{c}^{\text{min}} and maximum, \vec{y}_{c}^{\text{max}} values of each channel and linearly maps all values between those extrema to new values between the possible channel extrema of c^{\text{min}} = 0 and c^{\text{max}} = 255.

\displaystyle \hat{I}(\vec{x})_c =  \frac{I(\vec{x})_c - \vec{y}^{\text{min}}_c}{\vec{y}^{\text{max}}_c - \vec{y}^{\text{max}}_c} (c^\text{max} - c^\text{min}) + c^\text{min}

This particular image-based filter can be applied in quadratic time, \mathcal{O}(N^2), to calculate the extrema of the image and apply the linear map.

From left to right: original image, normalized image and accompanying histograms.
Original image: Daredevil 1 (1964). Art by Jack Kirby. Copyright © 1964 Marvel Comics.

Edge Detection

Edge detection is the process of identifying changes (e.g., texture, color, luminance and so on) in an image. As alluded to in the image processing section, the Laplacian filter is central to detecting edges within an image. As a result A sequence of filters is used before and after a Laplacian filter to produce a detector that consistently segments comic book covers. The following sequence of filters was used.

  1. Grayscale Projection – Since all logical components of a comic book cover are separated by inked lines, it is permissible to ignore the full set of RGB channel information and collapse the image down to a grayscale image.
  2. Normalization – It is conceivable that the input image has poor contrast and brightness. To ensure that the full range of luminescence values are presented, the image is normalized.
  3. Gaussian (\sigma = 1.5) – An image may have some degree of noise superimposed on the image. To reduce the noise, the image is blurred using a Gaussian filter with a standard deviation of 1.5. This is enough to smooth out the image without distorting finer image detail.
  4. Laplacian – Once the image has been prepared, its edges are calculated using the Laplacian filter.
  5. Normalization – Most of the changes in the image may be subtle and need to make sure that all edge information is accentuated as much as possible by applying a normalization filter.
  6. Step Threshold (\gamma = 0.05) – Since a partial edge isn’t particularly useful information, any edge RGB value less than 12.75 is attenuated to zero and all other values accentuated to 255.
  7. Inverse Distance (\gamma = 1.5) – It is possible that during the threshold process that discontinuities were introduced into some of the edges. To mitigate this impact, an inverse distance filter is used to inflate existing edges and intensify the result with a gain of 1.5.
From left to right, top to bottom: original image, grayscale projection, normalization, Gaussian (\sigma = 1.5), Laplacian, normalization, step threshold (\gamma = 0.05), inverse distance (\gamma = 1.5).
Original image: Captain America 111 (1964). Art by Jim Steranko. Copyright © 1968 Marvel Comics.

The complete edge detection process takes computational complexity of \mathcal{O}(N^2 \log_2 N) due to the neighbor-based filters used to eliminate noise and smooth edge discontinuities.


With the edge image it is possible to segment the image into its visual components. This is achieved by doing a flood fill on the image and using the edge image as the boundaries for the fill. Once the fill runs out of points to flood, the segment is complete and the next remaining point in the image is considered. To reduce the number of minuscule segments, only those segments representing 0.1 \% of the image are included.

From left to right: original image, edge image, sample overlaid segments and sample segments.
Original image: Fantastic Four Vol 1 59 (1967). Art by Jack Kirby. Copyright © 1967 Marvel Comics.

Machine Learning


The task of classification is to identify decision boundaries separating all of the classification within the data set. Such data sets can be linearly or non-linearly separable and as a result, classifiers were developed to solve the linear case and then adapted to deal with the more complicated non-linear case. While there are a number of classifiers, only the K-Nearest Neighbor and Support Vector Machine classifiers were researched and implemented in this project.

K-Nearest Neighbor

The K-Nearest Neighbor classifier is an online classifier which operates under the assumption that a yet to be classified vector is most likely to be the same classification as those k training vectors which are closest to the vector based on a distance measure, d : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}.

Distance can be measured in a variety of ways for arbitrary vectors, \vec{x}, \vec{y} \in \mathbb{R}^n, residing in some real space. The most common of which are specialized cases of the Minkowski distance.

\displaystyle d_{p}(\vec{x},\vec{y}) = \left ( \sum_{i = 0}^{n} \lvert \vec{x}_{i} - \vec{y}_{i} \rvert^{p} \right )^{ \frac{1}{p} }

The Manhattan distance, d_{1}(\vec{x},\vec{y}), yields the distance traveled along a grid between two vectors (hence a name in reference to the New York City borough). The Euclidean distance, d_{2}(\vec{x}, \vec{y}), gives the distance between the vectors in the usual familiar sense. The last specialized cased considered is the Chebyshev distance, d_{\infty}(\vec{x},\vec{y}), which gives the maximum distance between any one dimension of the two vectors.

Two factors affect the efficacy of the algorithm. The first is the dimension of the data, n, and the size of the train data set, N. As the training data set increases with size, there are more vectors which a test vector must be compared against. As a result, an efficient means of searching the training set must be used to yield satisfactory performance. This can be achieved by using kd-Trees which give \mathcal{O}(\log N) search performance or branch and bound methods giving similar performance. As the dimensionality of the dataset increases, the efficacy of kd-Trees diminishes to a near linear search of the training data set due to the “curse of dimensionality.”

From left to right: Test point (green triangle) near two categorizes, identification of nearest neighbors, and resolved category.

Support Vector Machine

The Support Vector Machine classifier is an offline linear, binary classifier which operates under the assumption that a training set, (\vec{x}, y)^{(i)} \in \mathcal{D}, consists of linearly separable classifications, y \in \lbrace -1, +1 \rbrace , of data, \vec{x} \in \mathbb{R}^n, by some optimal hyperplane of the form \langle \vec{w}, \vec{x} \rangle + b = 0. Where \langle \cdot, \cdot \rangle is the inner product, \vec{w} \in \mathbb{R}^n and b \in \mathbb{R}. When \langle \vec{w}, \vec{x} \rangle + b \ge 1, then the classification +1 is presented and when \langle \vec{w}, \vec{x} \rangle + b \le -1, the classification -1 is presented.

From left to right: Training data of two categorizes, optimally separating hyperplane, and resolved categorizes in \mathbb{R}^2.

The hyperplane is padded by two hyperplanes separated by an equal distance to the nearest training examples of each classification. The span between the supporting hyper planes is the margin. The goal then is to pick a hyperplane which provides the largest margin between the two separable classifications. The margin between the supporting hyperplanes is given by \displaystyle \frac{2}{\lVert \vec{w} \rVert}. Given the demarcation criteria, the maximum margin will also be subject to the constraint that all training examples satisfy y^{(i)} (\langle \vec{w}, \vec{x}^{(i)} \rangle + b) - 1 \ge 0. As a result of the objective function and accompanying linear constraint, the problem is stated in terms of its native primal Quadratic Programming form.

\min \mathcal{L}_P(\vec{w}) = \frac{1}{2} \langle \vec{w}, \vec{w} \rangle subject to y^{(i)} (\langle \vec{w}, \vec{x}^{(i)} \rangle + b) - 1 \ge 0 \forall (\vec{x}, y)^{(i)} \in \mathcal{D}

To find the optimal parameters, it is easier to translate the problem into a dual form by applying the technique of Lagrange Multipliers. The technique takes an objective function, f, and constraint functions, g^{(i)}, and yields a new function \mathcal{L}(\vec{x}, \vec{\alpha}) = f(\vec{x}) + \sum \vec{\alpha}_i g(\vec{x})^{(i)} to be optimized subject to the added constraint \vec{\alpha}_i \ge 0 \forall i.

\displaystyle \max_{\vec{\alpha}} \mathcal{L}(\vec{w}, b, \vec{\alpha}) = \frac{1}{2} \langle \vec{w}, \vec{w} \rangle - \sum_{i=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_{i} (y^{(i)} (\langle \vec{w}, \vec{x}^{(i)} \rangle + b) - 1) \vec{\alpha} \in \mathbb{R}^{\lvert \mathcal{D} \rvert } subject to \vec{\alpha}_{i} \ge 0 \forall i

The next step is to differentiate the objective function with respect to the parameters to determine the optimal solution. Since the function is concave, the results will yield the desired maximum constraints.

\displaystyle \frac{\partial \mathcal{L}}{\partial b} = 0 \implies \sum_{i=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_i y^{(i)} = 0 \displaystyle \frac{\partial \mathcal{L}}{\partial \vec{w}} = 0 \implies \vec{w} = \sum_{i=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_i y^{(i)} \vec{x}^{(i)}

As a result the dual problem can be written as the following:

\displaystyle \max \mathcal{L}_D(\vec{\alpha}) = \frac{1}{2} \sum_{i=0}^{\lvert \mathcal{D} \rvert-1} \sum_{j=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_i \vec{\alpha}_j y^{(i)} y^{(j)} \langle \vec{x}^{(i)}, \vec{x}^{(j)} \rangle - \sum_{k=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_k subject to \vec{\alpha}_{i} \ge 0 \forall i, \displaystyle \sum_{j=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_j y^{(j)} = 0

Handling of non-linearly separable data

In the event that the data is not linearly separable, then an additional parameter, C, is added as a penalty factor for those values that reside on the wrong side of the hyperplane. The derivation for the quadratic program is identical to the one presented above with the exception that the lagrange multipliers now have an upper boundary 0 \le \vec{\alpha}_i \le C \forall i.

Non-linear classification

By way of Mercer’s Theorem, the linear Support Vector Machine can be modified to allow for non-linear classification through the introduction of symmetric, positive semidefinite kernel functions, \Phi : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}. The idea being that if the data is not linearly separable in its present dimensional space that by mapping it to a higher dimensional space that the data may become linearly separable by some higher dimensional hyperplane. The benefit of a kernel function is that the higher dimensional vector need not be computed explicitly. This “kernel trick” allows for all inner products in the dual representation to be substituted with a kernel.

\displaystyle \max \mathcal{L}_D(\vec{\alpha}) = \frac{1}{2} \sum_{i=0}^{\lvert \mathcal{D} \rvert-1} \sum_{j=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_i \vec{\alpha}_j y^{(i)} y^{(j)} \Phi(\vec{x}^{(i)}, \vec{x}^{(j)}) - \sum_{k=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_k subject to 0 \le \vec{\alpha}_{i} \le C \forall i, \displaystyle \sum_{j=0}^{\lvert \mathcal{D} \rvert-1} \vec{\alpha}_j y^{(j)} = 0

And the decision hyperplane function then becomes:

\displaystyle f(\vec{x}) = \sum_{i=0}^{\lvert \mathcal{D} \rvert-1} \alpha_i y^{(i)} \Phi (\vec{x}^{(i)}, \vec{x}) + b

The following are some typical kernels:

  • Linear – \Phi(\vec{x}, \vec{y}) = \langle \vec{x}, \vec{y} \rangle
  • Polynomial – \Phi(\vec{x}, \vec{y}) = (\gamma \langle \vec{x}, \vec{y} \rangle + r)^d \gamma > 0
  • Radial basis function – \Phi(\vec{x}, \vec{y}) = \exp{-\gamma \langle \vec{x} - \vec{y}, \vec{x} - \vec{y} \rangle} \gamma > 0
  • Sigmoid – \Phi(\vec{x}, \vec{y}) = \tanh (\gamma \langle \vec{x}, \vec{y} \rangle + r)

From a practical point of view, only the linear and radial basis function kernels from this list should be considered since the polynomial kernel has too many parameters to optimize and the sigmoid kernel does not satisfy the positive semidefinite kernel matrix requirement of Mercer’s Theorem.

Algorithmic details

The Support Vector Machine classifier can be implemented using a quadratic programming solver or by incremental descent algorithms. Both methods work, but are difficult to implement and expensive to procure. An alternative is the Sequential Minimal Optimization algorithm developed by John Platt at Microsoft Research. The algorithm works by analytically solving the dual problem for the case of two training examples then iterating over all of the lagrange multipliers verifying that the constraints are satisfied. For those that are not, the algorithm computes new lagrange multiplier values. The full details of the algorithm can be found in Platt’s paper.

The time complexity of the algorithm is quadratic with respect to the number of training samples and support vectors \mathcal{O}(N^2).

The time complexity of evaluating the decision function is linear with respect to the number of support vectors \mathcal{O}(N).

Multiclass Classification

The classification methods presented in the previous section are utilized as binary classifiers. These classifiers can be used to classify multiple classifications by employing a one-vs-all or all-vs-all approach. In the former a single classification is separated from the remaining classifications to produce N classifiers for the N classifications. Each classifier is then used to evaluate a vector and the classifier with the highest confidence is then used to declare the classification.

In the latter, a single classification is compared individually to each other classification resulting in \frac{N(N - 1)}{2} classifiers. All of the classifiers are then evaluated against the test vector and the classification with the greatest consensus from the classifiers is declared the classification of the test vector.

Both methods have their place. The benefit of a one-vs-all approach is that there are fewer classifiers to maintain. However, training a single classifier on a complete data set is time consuming and can give deceptive performance measures. All-vs-all does result in more classifiers, but it also provides for faster training which can be easily parallelized on a single machine and distributed to machines on a network.

Classifier Evaluation

Individual classifiers are evaluated by training the classifier against a data set and then determining how many correct and incorrect classifications were produced. This evaluation produces a confusion matrix.

Predicted Classification
Positive Negatives Total
Actual Classification Positive (TP) True Positive (FN) False Negative (AP) Actual Positives
Negatives (FP) False Positive (TN) True Negative (AN) Actual Negatives
Total (PP) Predicted Positives (PN) Predicted Negatives (N) Examples
Confusion matrix defintion and associated terms.

The confusion matrix is used to calculate a number of values which are used to evaluate the performance of the classifier. The first of which is the accuracy and error of the classifier. Accuracy measures the number of instances where the actual and predicted classifications matched up and the error for when they do not.

\displaystyle \text{Accuracy} = \frac{TP + TN}{N} \displaystyle \text{Error} = \frac{FP + FN}{N}

Since we should expect to get different results each time we evaluate a classifier, the values that we obtain above are sample estimates of the true values that are expected. Given enough trails and measurements, it is possible to determine empirically what the true values actually are. However, this is time consuming and it is instead easier to use confidence intervals to determine what interval of values a measurement is mostly likely to fall into.

Training and Testing

Each of the classifiers presented have some number of parameters that must be determined. The parameters can be selected by having some prior knowledge or by exploring the parameter space and determining which parameters yield optimal performance. This is done by performing a simple grid search over the parameter space and evaluating and attempting to minimize the error.

K-folds cross-validation is used at each grid location to produce a reliable measure of the error. The idea is that a data set is split into k disjoint sets. The first set is used as a validation set and the remaining k - 1 sets are used in unison as the training data set for the classifier. This process is done on the next set and so on until all k sets have been used as a validation set.



The system was implemented in C# 4.0 on top of the Microsoft .NET Framework. The user interface was written by hand using the WinForms library. No other third-party libraries or frameworks were used. When possible, all algorithms were parallelized to take advantage of multi-core capabilities to improve processing times.


The system consists of two modes of operation: training and production. In training, a human classifier labels image segments with an appropriate classification. New image segments are then taken into consideration during the training of machine learning algorithms. Those algorithms producing the lowest error for a given classification are then used in production mode. During production, a user submits an image and each image segment is then evaluated against the available classifiers. Those image segments are then presented to the user with the most likely classification. These two modes along with their workflows and components are illustrated in the following diagram.

System inputs, components and outputs used for training and production workflows.

Training Mode

Data Set Construction

The user interface of the system allows users to add an image segment to a local data set of images. Once added, the image is then processed to yield image segments. The user can then label an image segment by editing the segment and moving on to the next image segment. This allows for easy and efficient human classification of data. If the user does not wish to keep the image, he or she may remove the image from the data set as well.

Data set construction tab.
Input image: Superman 14 (1942). Art by Fred Ray. Copyright © 1942 DC Comics, Inc.

Data Set Cleaning

During the construction phase, errors may be introduced into the data set typically in the case of typos or forgetting which segment was currently being edited. The data set is cleaned by listing out all available classifications and presenting the user with all available segments associated with that classification. The user can then review the image segment as it was identified in the source image. If the user does not wish to keep the classification, he or she may remove the image from the data set as well.

Data set cleaning tab.
Input image: Action Comics 62 (1943). Art by Jack Burnley. Copyright © 1943 DC Comics, Inc.

Data Set Statistics

The data set consists of 496 comic book covers pulled from the Cover Browser database of comic book covers. The first 62 consecutive published comic book covers where used from Action Comics, Amazing Spider-man, Batman, Captain America, Daredevil, Detective Comics, Superman, and Wonder Woman and then processed by the image processing subsystem yielding 24,369 image segments. 11,463 of these segments represented classifiable segments which were then labeled by hand over the course of two weeks; the remaining segments were then discarded.

Data Set Distribution consists of predominately clothing and text with equal representation from all comic book series sources.

In total, there were 239 classifications identified in the data set among 18 categories. Text, clothing, geography, and transportation categories accounting for 90% of the data set. Since the majority of classification were incidental, only those classifications having 50 or more image segments were considered by the application leaving a total of 38 classifications.

Classifier Evaluation

For the 38 classifications meeting the minimum criteria for classification, the K-Nearest Neighbor approach worked well in distinguishing between text classifications from other classifications and between intra-text classifications for both all-vs-all and one-vs-all schemes.

All-vs-All K-Nearest Neighbor Performance. One-vs-All K-Nearest Neighbor Performance.

The Support Vector Machine approach presented unremarkable results for both all-vs-all and one-vs-all methods. In the former, only a few pairings resulted in acceptable error rates whereas the later presented only a couple acceptable error rates.

All-vs-All Support Vector Machine Performance. One-vs-All Support Vector Machine Performance.

For both classification methods presented, the all-vs-all method yielded superior results to the one-vs-all method. In comparing the two classifier methods, the K-Nearest Neighbor seems to have done better than the Support Vector Machine approach, contrary to what was expected from literature. Both classifier methods are used in production mode.

Production Mode

Production mode allows the end user to add an image to the data set and then review the most likely classifications produced by evaluating each image segment against the available set of classifiers. The end user is then expected to review each segment and accept or reject the suggested classification. Aside from this additional functionality, production mode is nearly identical in functionality to training mode.

Production data set construction tab.
Input image: Detective Comics 36 (1940). Art by Bob Kane. Copyright © 1940 DC Comics, Inc.


The time spent on this project was well spent. I met the objectives that I laid out at the beginning of the project and now have a better understanding of the image processing algorithms and machine learning concepts from a theoretical and practical point of view.

Future Work


One issue with the existing implementation is that it over segments the image. Ideally, fewer segments would be produced that are more closely aligned with their conceptual classification. There are a number of popular alternatives to the approach taken, such as level set methods, which should be further investigated.


The approach taken to map scaled versions of the image segments to a 2^{10} space is simple to implement, but it did not assist well in the classification process. Alternative mappings such as histogram models should be evaluated in the future to decrease classification times and to determine if classification error rates can be reduced.

System User Interface

While it was disappointing to have spent so much time building a data set only to have to limit what was considered, it assisted me in building a user interface that had to be easy and fast to use. The application can certainly be developed further and adapted to allow for other data sets to be constructed, image segmentation methods to be added and additional classifications to be evaluated.

System Scalability

The system is limited now to a single machine, but to grow and handle more classifications, it would need to be modified to run on multiple machines, have a web-based user interface developed and a capable database to handle the massive amounts of data that would be required to support a data set on the scale of the complete Cover Browser’s or similar sites’ databases (e.g., 450,000 comic book covers scaled linearly would require 546 GiB of storage.) Not to mention data center considerations for overall system availability and scalability.


Aly, Mohamed. Survey on Multiclass Classification Methods. [pdf] Rep. Oct. 2011. Caltech. 24 Aug. 2012.

Asmar, Nakhle H. Partial Differential Equations: With Fourier Series and Boundary Value Problems. 2nd ed. Upper Saddle River, NJ: Pearson Prentice Hall, 2004. Print.

Bousquet, Olivier, Stephane Boucheron, and Gabor Lugosi. “Introduction to Statistical Learning Theory.” [pdf] Advanced Lectures on Machine Learning 2003,Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003 (2004): 169-207. 7 July 2012.

Boyd, Stephen, and Lieven Vandenberghe. Convex Optimization [pdf]. N.p.: Cambridge UP, 2004. Web. 28 June 2012.

Burden, Richard L., and J. Douglas. Faires. Numerical Analysis. 8th ed. Belmont, CA: Thomson Brooks/Cole, 2005. Print.

Caruana, Rich, Nikos Karampatziakis, and Ainur Yessenalina. “An Empirical Evaluation of Supervised Learning in High Dimensions.” [pdf] ICML ’08 Proceedings of the 25th international conference on Machine learning (2008): 96-103. 2 May 2008. 6 June 2012.

Fukunaga, Keinosuke, and Patrenahalli M. Narendra. “A Branch and Bound Algorithm for Computing k-Nearest Neighbors.” [pdf] IEEE Transactions on Computers (1975): 750-53. 9 Jan. 2004. 27 Aug. 2012.

Gerlach, U. H. Linear Mathematics in Infinite Dimensions: Signals, Boundary Value Problems and Special Functions. Beta ed. 09 Dec. 2010. Web. 29 June 2012.

Glynn, Earl F. “Fourier Analysis and Image Processing.” [pdf] Lecture. Bioinformatics Weekly Seminar. 14 Feb. 2007. Web. 29 May 2012.

Gunn, Steve R. “Support Vector Machines for Classification and Regression” [pdf]. Working paper. 10 May 1998. University of Southampton. 6 June 2012.

Hlavac, Vaclav. “Fourier Transform, in 1D and in 2D.” [pdf] Lecture. Czech Technical University in Prague, 6 Mar. 2012. Web. 30 May 2012.

Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin. A Practical Guide to Support Vector Classification. [pdf] Tech. 18 May 2010. National Taiwan University. 6 June 2012.

Kibriya, Ashraf M. and Eibe Frank. “An empirical comparison of exact nearest neighbour algorithms.” [pdf] Proc 11th European Conference on Principles and Practice of Knowledge Discovery in Databases. (2007): 140-51. 27 Aug. 2012.

Marshall, A. D. “Vision Systems.” Vision Systems. Web. 29 May 2012.

Panigraphy, Rina. Nearest Neighbor Search using Kd-trees. [pdf] Tech. 4 Dec. 2006. Stanford University. 27 Aug. 2012.

Pantic, Maja. “Lecture 11-12: Evaluating Hypotheses.” [pdf] Imperial College London. 27 Aug. 2012.

Platt, John C. “Fast Training of Support Vector Machines Using Sequential Minimal Optimization.” [pdf] Advances in Kernel Methods – Support Vector Learning (1999): 185-208. Microsoft Research. Web. 29 June 2012.

Sonka, Milan, Vaclav Hlavac, and Roger Boyle. Image Processing, Analysis, and Machine Vision. 2nd ed. CL-Engineering, 1998. 21 Aug. 2000. Web. 29 May 2012.

Szeliski, Richard. Computer vision: Algorithms and applications. London: Springer, 2011. Print.

Tam, Pang-Ning, Michael Steinbach, and Vipin Kumar. “Classification: Basic Concepts, Decision Trees, and Model Evaluation.” [pdf] Introduction to Data Mining. Addison-Wesley, 2005. 145-205. 24 Aug. 2012.

Vajda, Steven. Mathematical programming. Mineola, NY: Dover Publications, 2009. Print.

Welling, Max. “Support Vector Machines“. [pdf] 27 Jan. 2005. University of Toronto. 28 June 2012

Weston, Jason. “Support Vector Machine (and Statistical Learning Theory) Tutorial.” [pdf] Columbia University, New York City. 7 Nov. 2007. 28 June 2012.

Zhang, Hui, Jason E. Fritts, and Sally A. Goldman. “Image Segmentation Evaluation: A Survey of Unsupervised Methods.” [pdf] Computer Vision and Image Understanding 110 (2008): 260-80. 24 Aug. 2012.


Images in this post are used under §107(2) Limitations on exclusive rights: Fair use of Chapter 1: Subject Matter and Scope of Copyright of the of the Copyright Act of 1976 of Title 17 of the United States Code.

Minesweeper Agent

with one comment


Lately I’ve been brushing up on probability, statistics and machine learning and thought I’d play around with writing a Minesweeper agent based solely on these fields. The following is an overview of the game’s mechanics, verification of an implementation, some different approaches to writing the agent and some thoughts on the efficacy of each approach.



Minesweeper was created by Curt Johnson in the late eighties and later ported to Windows by Robert Donner while at Microsoft. With the release of Windows 3.1 in 1992, the game became a staple of the operating system and has since found its way onto multiple platforms and spawned several variants. The game has been shown to be NP-Complete, but in practice, algorithms can be developed to solve a board in a reasonable amount of time for the most common board sizes.


An agent, \mathcal{A}, is presented a n \times m grid containing M uniformly distributed mines. The agent’s objective is to expose all the empty grid locations and none of the mines. Information about the mines’ grid locations is gained by exposing empty grid locations which will indicate how many mines exist within a unit (Chebyshev) distance of the grid location. If the exposed grid location is a mine, then the player loses the game. Otherwise, once all empty locations are exposed, the player wins. \textbf{PLAY-GAME}(\mathcal{A}, n, m, M)  \newline \indent H \leftarrow \textbf{INITIALIZE-HIDDEN}(n,m,M)  \newline \indent V \leftarrow \textbf{INITIALIZE-VISIBLE}(n,m,M)  \newline  \newline \indent \textbf{do}  \newline \indent \indent (i,j) \leftarrow \mathcal{A}(V)  \newline \indent \indent \textbf{EXPOSE}(H, V, i, j)  \newline \indent \indent \textbf{if} \quad V_{i,j} = \text{*}  \newline \indent \indent \indent \textbf{return} \quad \textit{Failure}  \newline \indent \textbf{while} \quad M \neq \lvert \textbf{U-LOCATIONS}(V) \rvert  \newline  \newline \indent \textbf{return} \quad \textit{Success}
The board consists of hidden and visible states. To represent the hidden, H, and visible state, V, of the board, two character matrices of dimension n \times m are used.

Characters ‘0’-‘8’ represent the number of neighboring mines, character ‘U’ to represent an unexposed grid location and character ‘*’ for a mine.

Neighbors of a grid location (i,j) is the set of grid locations such that 1 \le \lVert (u,v) - (i,j) \rVert_{\infty} \le r. By default, r = 1.

\textbf{INITIALIZE-VISIBLE}(n, m)   \newline \indent \textbf{return} \quad \textbf{INITIALIZE}(n,m, \text{U})

\textbf{INITIALIZE-HIDDEN}(n, m, M)  \newline \indent H \leftarrow \textbf{INITIALIZE}(n,m,\text{0})  \newline  \newline \indent S \leftarrow \emptyset  \newline \indent \textbf{while} \quad \lvert S \lvert < M  \newline \indent \indent S \leftarrow S \cup \textbf{RANDOM}(\textbf{LOCATIONS}(H))  \newline   \newline \indent \textbf{foreach} \quad (i, j) \in S  \newline \indent \indent H_{i, j} = \text{*}  \newline \indent \indent \textbf{foreach} \quad (u,v) \in \textbf{NEIGHBORS}(n, m, i, j) \setminus S   \newline \indent \indent \indent H_{u,v} \leftarrow H_{u,v} + 1  \newline  \newline \indent \textbf{return} \quad H

Exposing Cells
The expose behavior can be thought of as a flood fill on the grid, exposing any empty region bordered by grid locations containing mine counts and the boundaries of the grid.

A matrix, F \in \mathbb{Z}^{n \times m}, represents the topography of the board. A value of zero is reserved for sections of the board that have yet to be visited, a value of one for those that have, two for those that are boundaries and three for mines. A stack, S, keeps track of locations that should be inspected.

If a cell location can be exposed, then each of its neighbors will be added to the stack to be inspected. Those neighbors that have already been inspected will be skipped. Once all the reachable grid locations have been inspected, the process terminates.

\textbf{EXPOSE}(H, V, i, j)  \newline \indent \textbf{foreach} \quad (u,v) \in \textbf{LOCATIONS}(H)  \newline \indent \indent \textbf{if} \quad H_{u,v} = \text{0}  \newline \indent \indent \indent F_{u,v} \leftarrow 0  \newline \indent \indent \textbf{if} \quad H_{i,j} = \text{*}  \newline \indent \indent \indent F_{u,v} \leftarrow 3  \newline \indent \indent \textbf{else}  \newline \indent \indent \indent F_{u,v} \leftarrow 2  \newline  \newline \indent \textbf{PUSH}(S, (i, j))  \newline \indent \textbf{do}  \newline \indent \indent (u, v) \leftarrow \textbf{POP}(S)  \newline \indent \indent \textbf{if} \quad F_{u,v} = 0  \newline \indent \indent \indent F_{u,v} \leftarrow 1  \newline \indent \indent \indent \textbf{foreach} \quad (r,s) \in \textbf{NEIGHBORS}(H, u, v)  \newline \indent \indent \indent \indent \textbf{PUSH}(S, (r,s))  \newline \indent \indent \textbf{elseif} \quad F_{u,v} \in (1, 2)  \newline \indent \indent \indent V_{u,v} \leftarrow H_{u, v}  \newline \indent \textbf{while} \quad \textbf{COUNT}(S) > 0



Statistical tests are used to verify the random aspects of the game’s implementation. I will skip the verification of the game’s logic as it requires use of a number of different methods that are better suited for their own post.

There are two random aspects worth thinking about: the distribution of mines and the distribution of success (i.e., not clicking a mine) for random trials. In both scenarios it made since to conduct Pearson’s chi-squared test. Under this approach there are two hypotheses:

  • H_{0}: The distribution of experimental data follows the theoretical distribution
  • H_{a}: The distribution experimental data does not follow the theoretical distribution

H_{0} is accepted when the test statistic, \chi^2, is less than the critical value, \chi_{\alpha}^2. The critical value is determined by deciding on a p-value (e.g., 0.05, 0.01, 0.001), \alpha, that results in the tail area beneath the chi-squared distribution \chi_{k}^2(x) equal to \alpha. k is the degrees of freedom in the observation.

Mine distribution

The first aspect to verify was that mines were being uniformly placed on the board. For a standard 9 \times 9 board with 10 mines, the expectation is that each grid location should be assigned \frac{10}{81} N times for N trials. k = 80 for this experiment.

In the above experiment, \chi^2 = 71.78 and \chi_{0.05}^2 = 101.87. Since \chi^2 < \chi_{0.05}^2, this affirms H_{0} and that the implemented distribution of mines is indeed uniform with a statistical significance of 0.05.

Distribution of successful clicks

The second aspect to verify is that the number of random clicks before exposing a mine follows a hypergeometric distribution. The hypergeometric distribution is appropriate since we are sampling (exposing) without replacement (the grid location remains exposed after clicking). This hypothesis relies on a non-flood-fill exposure.

The distribution has four parameters. The first is the number of samples drawn (number of exposures), the second the number of successes in the sample (number of empty exposures), the third the number of successes in the population (empty grid locations) and the last the size of the population (grid locations): h(nm-M;nm-M,nm-M, nm).

The expected frequencies for the hypergeometric distribution is given by N h(nm - M; nm - M, nm - M, nm) for N trials. k = 70 in this case.

In the above experiment \chi^2 = 47.24 and \chi_{0.05}^2 = 90.53. Since \chi^2 < \chi_{0.05}^2, this affirms H_{0} and that the number of locations exposed prior to exposing a mine follows a hypergeometric distribution with a statistical significance of 0.05.

Also included in the plot is the observed distribution for a flood based exposure. As one might expect, the observed frequency of more exposures decreases more rapidly than that of the non-flood based exposure.



Much like how a human player would learn to play the game, I decided that each model would have knowledge of game’s mechanics and no prior experience with the game. An alternative class of agents would have prior experience with the game as the case would be in a human player who had studied other player’s strategies.

To evaluate the effectiveness of the models, each played against a series of randomly generated grids and their respective success rates were captured. Each game was played on a standard beginner’s 9 \times 9 grid containing between [1, 10] mines.

For those models that refer to a probability measure, \mathbb{P}, it is assumed that the measure is determined empirically and treated as an estimate of the probability of an event and not as an a priori measure.

Marginal Model


The first model to consider is the Marginal Model. It is designed to simulate the behavior of a naive player who believes that if he observes a mine at a grid location that the location should be avoid in future trials.

The model treats the visible board, V, as a matrix of discrete random variables where each grid location is interpreted as either \textit{Empty} or (a) \textit{Mine}. This model picks the grid location with the greatest empirical probability of being empty:

\underset{(i,j)}{\arg\max} \quad \mathbb{P}(X_{i,j} = \textit{Empty})

\textbf{MARGINAL-MODEL}(V)  \newline \indent p_{\textit{max}} \leftarrow \perp  \newline \indent (i, j)_{\textit{max}} \leftarrow \perp  \newline  \newline \indent \textbf{foreach} \quad (i,j) \in \textbf{U-LOCATIONS}(V)  \newline \indent \indent p \leftarrow \mathbb{P}(V_{i,j} = \textit{Empty})  \newline \indent \indent \textbf{if} \quad p > p_{\textit{max}}   \newline \indent \indent  \indent p_{\textit{max}} \leftarrow p  \newline \indent \indent \indent (i,j)_{\textit{max}} \leftarrow (i,j)  \newline  \newline \indent \textbf{return} \quad (i,j)_{\textit{max}}

Test Results

Since the mine distribution is uniform, the model should be equivalent to selecting locations at random. The expected result is that avoiding previously occupied grid locations is an ineffective strategy as the number of mines increases. This does however, provide an indication of what the success rate should look like for chance alone.

Conditional Model


One improvement over the Marginal Model is to take into account the visual clues made visible when an empty grid location is exposed. Since an empty grid location represents the number of neighboring mines, the Conditional Model can look at these clues to determine whether or not an unexposed grid location contains a mine.

This boils down to determining the probability of \mathbb{P}(\textit{Mine} \lvert \textrm{Evidence}). A simplification in calculating the probability is to assume that each piece of evidence is independent. Under this assumption the result is a Naïve Bayes Classifier:

\mathbb{P}( C = c \vert X = x ) = \dfrac{\mathbb{P}(C = c) \prod \mathbb{P}( X_{i} = x_{i} \vert C = c)}{\prod\mathbb{P}(X_{i} = x_{i})}

As in the case of the Marginal Model, the Conditional Model returns the grid location that it has determined has the greatest probability of being empty given its neighbors:

\underset{(i,j)}{\arg\max} \quad \mathbb{P}(C_{i,j} = \textit{Empty} | N(V_{i,j}) )

\textbf{CONDITIONAL-MODEL}(V, r)  \newline \indent C \leftarrow \lbrace \textit{Empty}, \textit{Mine} \rbrace  \newline \indent S \leftarrow \textbf{U-LOCATIONS}(V)  \newline  \newline \indent T \leftarrow \emptyset    \newline \indent \textbf{foreach} \quad (i,j) \in S  \newline \indent \indent N \leftarrow \textbf{NEIGHBORS}(V, i, j, r)  \newline \indent \indent p_{\textit{max}} \leftarrow \perp  \newline \indent \indent c_{\textit{max}} \leftarrow \perp  \newline \indent \indent \textbf{foreach} \quad c \in C  \newline \indent \indent \indent p \leftarrow \mathbb{P}(C = c)  \newline \indent \indent \indent \textbf{foreach} \quad (u,v) \in N  \newline \indent \indent \indent \indent p \leftarrow p * \mathbb{P}( X_{i, j} = V_{i, j} \vert C = c )  \newline \indent \indent \indent \textbf{if} \quad p > p_{\textit{max}}  \newline \indent \indent \indent \indent p_{\textit{max}} \leftarrow p  \newline \indent \indent \indent \indent c_{\textit{max}} \leftarrow c  \newline \indent \indent \textbf{if} \quad c_{\textit{max}} = \textit {Empty}  \newline \indent \indent \indent T \leftarrow T \cup (i, j)  \newline  \newline \indent \textbf{return} \quad \textbf{RANDOM}(T)
Test Results

The Naïve Bayes Classifier is regarded as being an effective approach to classifying situations for a number of different tasks. In this case, it doesn’t look like it is effective at classifying mines from non-mines. The results are only slightly better than the Marginal Model.

Graphical Model


One shortfall of the Conditional Model is that it takes a greedy approach in determining which action to take. A more sophisticated approach is to not just consider the next action, but the possible sequence of actions that will minimize the possibility of exposing a mine.

Each of the possible observable grids, S, can be thought of as a set of vertices in a graph whose corresponding set of edges represent the transition between a state, s, to the next observable state, s^{\prime}. Each transition was achieved by performing an action, a \in A, on the state. The specific action, \alpha : S \times S \to A, is chosen from a subset of permitted actions given the state. Each transition has a probability, \mathbb{P}(s^{\prime} \vert s), of taking place.

It is possible to pick a path, \pi, through this graph that minimizes the risk by assigning a reward, \rho : S \to \mathbb{R}, to each state and attempting to identify an optimal path, \pi^{*}_{s}, from the present state that yields the greatest aggregate reward,

\displaystyle \varrho(\pi) = \sum_{(s, s^{\prime}) \in \pi} \rho(s^{\prime}) \mathbb{P}(s^{\prime} \vert s)

Solving for \pi^{*}_{s} is equivalent to solving the Longest Path Problem and can be computed efficiently using a dynamic programming solution.

\displaystyle \pi_{s}^{*} \gets \underset{\pi_{s}}{\arg\max} \ \sum_{(\sigma, \sigma^{\prime}) \in \pi_{s}} \rho(\sigma^{\prime}) \mathbb{P}(\sigma^{\prime} \vert \sigma) \ \pi_{s} \in \Pi_{s}

\textbf{GRAPHICAL-MODEL}(V)  \newline \indent (a, r)_{\textit{max}} \gets (\bot, \bot)  \newline \indent T \gets \emptyset  \newline  \newline \indent \textbf{foreach} \quad U \in \textbf{SUB-GRIDS}(V)  \newline \indent \indent (a, r)_{U} \gets \textbf{OPTIMAL-ACTION}(U, \bot)  \newline \indent \indent \textbf{if} \quad r_{U} > r_{\textit{max}}  \newline \indent \indent \indent (a,r)_{\textit{max}} \gets (a,r)_{U}  \newline \indent \indent \indent T \gets \emptyset  \newline \indent \indent \textbf{if} \quad r_{U} = r_{\textit{max}}  \newline \indent \indent \indent T \gets T \cup (a,r)_{U}  \newline  \newline \indent (a, r)_{\textit{max}} \gets \textbf{RANDOM}(T)  \newline  \newline \indent \textbf{return} \quad a_{\textit{max}}  \newline  \newline  \textbf{OPTIMAL-ACTION}(V)  \newline \indent (a, r)_{\textit{max}} \gets (\bot, \bot)  \newline  \newline \indent \textbf{foreach} \quad (i,j) \in \textbf{U-LOCATIONS}(V)  \newline \indent \indent \textbf{foreach} \quad V^{\prime} \in \textbf{OUTCOMES}(V, (i,j))  \newline \indent \indent \indent (a, r)_{V^{\prime}} \gets \textbf{OPTIMAL-ACTION}(V^{\prime})  \newline \indent \indent \indent r \gets r_{V^{\prime}} + \mathbb{P}(V^{\prime} \vert V) * \textbf{REWARD}(V^{\prime})  \newline \indent \indent \indent \textbf{if} \quad r \ge r_{\textit{max}}  \newline \indent \indent \indent \indent (a, r)_{\textit{max}} \gets ((i,j), r)  \newline  \newline \indent \textbf{return} \quad (a,r)_{\textit{max}}  \newline  \newline  \textbf{REWARD}(V)  \newline \indent \textbf{if} \quad \textbf{SUCCESS}(V)  \newline \indent \indent \indent \textbf{return} \quad +100  \newline \indent \textbf{if} \quad \textbf{FAILURE}(V)  \newline \indent \indent \indent \textbf{return} \quad -100  \newline  \newline \indent \textbf{return} +1

From the optimal walk, a sequence of optimal actions is determined by mapping \alpha over the path. Taking the first action gives the optimal grid location to expose given the current visible state of the board.

This description constitutes a Markov Decision Process. As is the case for most stochastic processes, it is assumed that the process holds the Markov Property; that future states only depend upon the current states and none of the prior states. In addition to being a Markov Decision Process, this is also an example of Reinforcement Learning.

First thing to observe is that the game state space is astronomical. For a standard beginner’s grid there is at most a sesvigintillion (10^{81}) possible grids that a player can encounter. Which as an aside, is on the order of the number of atoms in the observable universe! The set of actions at each state is slightly more manageable with at most eighty-one actions.

To simplify the state space, I chose to only consider 3 \times 3 boards and when evaluating a full grid, consider the possible sub-grids and evaluate the optimal sequence of actions for each sub-grid and pick the maximum reward associated for each sub-grid that was evaluated as the action to take on the full grid.

Test Results

The Graphical Model produces results that are only a margin better than those of the Conditional Model.

Semi-deterministic Model


The last model I’m going to talk about is a semi-deterministic model. It works by using the visible grid to infer the topology of the hidden grid and from the hidden grid, the topology that the visible grid can become.

The grid can be viewed as a graph. Each grid location is a vertex and an edge is an unexposed grid location’s influence on another grid location’s neighbor mine count.

For each of the exposed grid locations on the board, v_{i,j}, it’s neighbors, N(v_{i,j}), are all mines when the number of inbound edges E_{i,j} = d(v_{i,j}), matches the visible mine count V_{i,j}.

The model produces its inferred version, F, of the influence graph E by using the determined mine locations M.

For each of the grid locations that are exposed and the inferred influence matches the visible count, then each of the neighbors about that location can be exposed provided they are not already exposed and not an inferred mine.

From this set of possibilities, a mine location is chosen. When no mine locations can be determined, then an alternative model can be used.

\textbf{SEMI-DETERMINISTIC-MODEL}(V)  \newline \indent E \leftarrow 0_{n, m}  \newline \indent \textbf{foreach} \quad (i,j) \in \textbf{LOCATIONS}(V)  \newline \indent \indent \textbf{if} \quad V_{i,j} \ne \textit{U}  \newline \indent \indent \indent \textbf{continue}  \newline \indent \indent \textbf{foreach} \quad (u,v) \in \textbf{NEIGHBORS}(V, i, j)  \newline \indent \indent \indent E_{u,v} \leftarrow E_{u,v} + 1  \newline  \newline \indent M \leftarrow \textit{False}_{n,m}  \newline \indent \textbf{foreach} \quad (i,j) \in \textbf{LOCATIONS}(V)  \newline \indent \indent \textbf{if} \quad V_{i,j} \in \lbrace \textit{U}, \textit{0} \rbrace \lor V_{i,j} \neq E_{i,j}  \newline \indent \indent \indent \textbf{continue}  \newline \indent \indent \textbf{foreach} \quad (u,v) \in \textbf{NEIGHBORS}(V, i, j)  \newline \indent \indent \indent \textbf{if} \quad V_{i,j} = \textit{U}  \newline \indent \indent \indent \indent M_{u,v} \leftarrow \textit{True}   \newline  \newline \indent F \leftarrow 0_{n,m}  \newline \indent \textbf{foreach} \quad (i,j) \in \textbf{LOCATIONS}(V)  \newline \indent \indent \textbf{foreach} \quad (u,v) \in \textbf{NEIGHBORS}(V, i, j)  \newline \indent \indent \indent \textbf{if} \quad M_{u,v}  \newline \indent \indent \indent \indent F_{i,j} \leftarrow F_{i,j} + 1  \newline  \newline \indent S \leftarrow \emptyset  \newline \indent \textbf{foreach} \quad (i,j) \in \textbf{LOCATIONS}(V)  \newline \indent \indent \textbf{if} \quad V_{i,j} = \textit{U} \lor F_{i,j} \ne V_{i,j}  \newline \indent \indent \indent \textbf{continue}  \newline \indent \indent \textbf{foreach} \quad (u,v) \in \textbf{NEIGHBORS}(V, i, j)  \newline \indent \indent \indent \textbf{if} \quad V_{i,j} \ne \textit{U} \lor M_{u,v}  \newline \indent \indent \indent \indent \textbf{continue}  \newline \indent \indent \indent S \leftarrow S \cup (u, v)  \newline   \newline \indent \textbf{return} \quad \textbf{RANDOM}(S)

Test Results

Since the model is a more direct attempt at solving the board, its results are superior to the previously presented models. As the number of mines increases, it is more likely that it has to rely on a more probabilistic approach.


Each of the models evaluated offered incremental improvements over their predecessors. Randomly selecting locations to expose is on par with choosing a location based on previously observed mine locations. The Conditional Model and Graphical Model yield similar results since they both make decisions based on conditioned probabilities. The Semi-deterministic Model stands alone as the only one model that produced reliable results.

The success rate point improvement between the Condition and Marginal models is most notable for boards consisting of three mines and the improvement between Graphical and Semi-deterministic models for seven mines. Improvements between Random and Marginal models is negligible and between Conditional and Graphical is minor for all mine counts fewer than seven.

Given the mathematical complexity and nondeterministic nature of the machine learning approaches, (in addition the the complexity and time involved in implementing those approaches) they don’t seem justified when more deterministic and simpler approaches exist. In particular, it seems like most people have implemented their agents using heuristics and algorithms designed to solve constraint satisfaction problems. Nonetheless, this was a good refresher to some of the elementary aspects of probability, statistics and machine learning.


Classification – Naïve Bayes.” Data Mining Algorithms in R. Wikibooks. 3 Nov. 2010. Web. 30 Oct. 2011.

Windows Minesweeper.” MinesweeperWiki. 8 Sept. 2011. Web. 30 Oct. 2011.

Kaye, Richard. “Minesweeper Is NP-complete.” [pdf] Mathematical Intelligencer 22.2 (2000): 9-15. Web. 30 Oct. 2011.

Nakov, Preslav, and Zile Wei. “MINESWEEPER, #MINESWEEPER.” 14 May 2003. Web. 14 Apr. 2012.

Richard, Sutton, and Andrew G. Barto. “3.6 Markov Decision Processes.” Reinforcement Learning: An Introduction. Cambridge, Massachusetts: Bradford Book, 1998. 4 Jan. 2005. Web. 30 Oct. 2011.

Rish, Irene “An Empirical Study of the Naive Bayes Classifer.” [pdf] IJCAI-01 Workshop on Empirical Methods in AI (2001). Web. 30 Oct. 2011.

Russell, Stuart J., and Peter Norvig. Artificial Intelligence: A Modern Approach. Upper Saddle River, NJ: Prentice Hall/PearsonEducation., 2003. Print.

Sun, Yijun, and Jian Li. “Adaptive Learning Approach to Landmine Detection.” [pdf] IEEE Transactions of Aerospace and Electronic Systems 41.3 (2005): 1-9. 10 Jan. 2006. Web. 30 Oct. 2011.

Taylor, John R. An introduction to error analysis: the study of uncertainties in physical measurements. Sausalito, CA: University Science Books, 1997. Print.

An Experiment in Optical Character Recognition

leave a comment »


I’ve always been interested in machine learning and how it can be applied to a number of different problems. I spent some time during college learning some of the techniques used in machine learning, but since then I’ve gotten a bit rusty. So, I revisited the subject by doing some additional research and decided to try my hand at Optical Character Recognition (OCR)- the process of identifying characters within an image and producing a text output. There are a handful of traditional aspects to this process: image acquisition, segmentation, recognition and correction. Acquisition is about correcting an input image so that a segmentation process can be readily applied; segmentation identifies the characters in the image, recognition maps those visual characters to text characters, finally correction takes the text output and rectifies any errors. The following outlines my approach to segmentation and recognition.


Consider the following body of text from one of my earlier posts:

The goal is to extract each of the characters in the image. The simplest way to accomplish this is to implement an algorithm that reads the image much in the same way one might read a block of text: start at the top of the text and scan downward identifying all of the rows of text, then for each row, read all the characters from left to right, then identify words based on white space. Figuring out the word boundaries is done via a simple clustering process. Assuming we have an ordered set of rectangles produced by the first two steps, we can calculate the average distance between consecutive rectangles. Then, once this average has been produced, to then iterate over the list once more adding rectangles to words when the distance between consecutive rectangles is less than the average distance, then creating new words when the distance is exceeded.

This segmentation approach isn’t perfect as it assumes that we are dealing with evenly spaced characters of the same size from the same font. Of course, this isn’t always the case and we have things like kerning and ligatures to deal with. In this example two categories of problems arise: character combinations such as ay, ey and ly that are separable then combinations such as yw, rt and ct that are not separable end up being interpreted as a single character using this method. For the first category, I chose to divide rectangles whenever a line of characters has a single black pixel that does not have a black pixel neighboring ((x-1, y + a) | a <- [-1, 1]) it to the left. The second case isn't as clear cut as all the pixels neighbor one another, in principal one could implement a k-means clustering algorithm, but that assumes you know how many characters (k) you have in the image. I decided to allow the error to propagate through the system.


Starting out, I knew that I wanted to use an Artificial neural network (ANN), so I spent some time reading Stuart’s and Norvig’s “Artificial Intelligence: A Modern Approach“, but felt that it wasn’t as comprehensive as I wanted, so I also read MacKay’s “Information Theory, Inference and Learning Algorithms,” which was more in tune with what I had in mind. I also came across a series (1, 2, 3) of pdf files hosted at Aberdeen’s Robert Gordon University that provided a more practical view of applying neural networks to pattern recognition.

A little background on ANNs: The general idea is that an individual neuron aggregates the weighted inputs from other neurons then outputs a signal. The magnitude of the signal is determined as a function of the aggregation called the activation function. Neurons are organized into layers, typically an input layer, one or more hidden layers and finally an output layer. Values from the input layer and feed into the hidden layer, then those outputs feed into the next and so on until all the values have gone through the output layer. The process of getting the network to map an input to an output is accomplished through training, in this case, a method known as Backpropagation. Given an input and an expected output, the input is feed through the network and produces an output. The difference between the two output vectors is the error that then needs to be feed backward through the network updating each node’s input weights such that the net error of the system is reduced. The method is effectively a gradient descent algorithm. I recommend reading the aforementioned references for details on how it all works. Following is my implementation of the Backpropagation algorithm:

using System;
using System.Linq;
using Library.Mathematics;

namespace OCRProject.Recognizers.NeuralNetworks {
    public class Neuron {
        Func<double, double> activationFunction;

        public Vector InputWeights { get; set; }

        public Neuron(Func<double, double> activationFunction) {
            this.activationFunction = activationFunction;

        public double Output(Vector input) {
            return activationFunction(;

    public class NeuralNetwork {
        private Neuron[] hiddenLayer, outputLayer;


        public Vector Output(Vector input) {
            Vector hiddenLayerOutput = new Vector(hiddenLayer.Length, (i) => hiddenLayer[i].Output(input));
            return new Vector(outputLayer.Length, (i) => outputLayer[i].Output(hiddenLayerOutput));

        public Vector Train(Vector input, Vector desiredOutput, double learningRate) {
            Vector hOutput = new Vector(hiddenLayer.Length, (i) => hiddenLayer[i].Output(input));
            Vector oOutput = new Vector(outputLayer.Length, (i) => outputLayer[i].Output(hOutput));

            Vector oError = new Vector(
                (i) => oOutput[i] * (1 - oOutput[i]) * (desiredOutput[i] - oOutput[i])

            for (int n = 0; n < outputLayer.Length; n++) {
                outputLayer[n].InputWeights = new Vector(
                    (i) => outputLayer[n].InputWeights[i] + learningRate * oError[n] * hOutput[i]

            Vector hError = new Vector(
                (i) => hOutput[i] * (1 - hOutput[i]) * ( Vector(oError.Dimension, (j) => outputLayer[j].InputWeights[i])))

            for (int n = 0; n < hiddenLayer.Length; n++) {
                hiddenLayer[n].InputWeights = new Vector(
                    (i) => hiddenLayer[n].InputWeights[i] + learningRate * hError[n] * input[i]

            return oError;

In terms of how all of this applies to OCR, I started out with all visible characters and produced a collection of 16×16 images. Each image contained a single character centered in the image. This image was then mapped to a 256 dimensional vector with its corresponding character mapped to an 8 dimensional vector representing the expected output. The question that remains is how many hidden layer units should be used. To figure this out, I conducted a small experiment:

Hidden Units 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Minimum (%) 0 4 7 21 36 48 69 80 87 85 92 88 92 89 93 93
Average (%) 1 4 12 28 44 56 72 83 89 90 94 93 94 93 95 95
Maximum (%) 1 6 15 31 51 62 75 87 91 93 99 96 98 95 98 99

Each run consisted of 10 trials with each trial taking 5000 iterations to train the network. (Normally, one can measure the Mean squared error of the network and halt the training process once it is sufficiently small.) After doing this testing, I found that ANNs with 11 hidden units looked to give the highest accuracy with the fewest number of units. Given the original text that was used, the following text was produced:

As expected, the category of errors that were identified earlier (character combinations ff, rt, ct) were not segmented correctly and consequently not recognized correctly.


There is a lot of additional work that could be thrown at the project. In the future, it’d be good to modify the solution to accept any sized font as well as any font, adding support for images containing scanned or photographed text rather than computer generated images and some error correction on the output to flag parts of the text that require review. I suspect that I will continue down this road and start investigating how these methods can be applied to general computer vision for some planned robotics projects.