Antimatroid, The

thoughts on computer science, electronics, mathematics

Posts Tagged ‘Machine Learning

GPU Accelerated Expectation Maximization for Gaussian Mixture Models using CUDA

leave a comment »

C, CUDA, and Python source code available on GitHub


Gaussian Mixture Models [1, 435-439] offer a simple way to capture complex densities by employing a linear combination of K multivariate normal distributions, each with their own mean, covariance, and mixture coefficient, \pi_{k}, s.t. \sum_{k} \pi_{k} = 1.

\displaystyle p( x ) = \sum_{k = 1}^{K} \pi_{k} p(x \lvert \mu_k, \Sigma_k)

Of practical interest is the learning of the number of components and the values of the parameters. Evaluation criteria, such as Akaike and Bayesian, can be used to identify the number of components, or non-parametric models like Dirichlet processes can be used to avoid the matter all together. We won’t cover these techniques here, but will instead focus on finding the values of the parameters given sufficient training data using the Expectation-Maximization algorithm [3], and doing so efficiently on the GPU. Technical considerations will be discussed and the work will conclude with an empirical evaluation of sequential and parallel implementations for the CPU, and a massively parallel implementation for the GPU for varying numbers of components, points, and point dimensions.

Multivariate Normal Distribution

The multivariate normal distribution With mean, \mu \in \mathbb{R}^d, d \in \mathbb{N}_1, and symmetric, positive definite covariance, \Sigma \in \mathbb{R}^{d \times d}, is given by:

\displaystyle 	p( x \lvert \mu, \Sigma ) = \frac{1}{\sqrt{(2\pi)^d \lvert \Sigma \rvert }} \exp{\left( - (x - \mu)^{T} \Sigma^{-} (x - \mu) / 2 \right)}

From a computational perspective, we will be interested in evaluating the density for N values. Thus, a naive implementation would be bounded by \mathcal{O}\left(N d^4\right) due to the matrix determinate in the normalization term. We can improve upon this by computing the Cholesky factorization, \Sigma = L L^T, where L is a lower triangular matrix [6, 157-158]. The factorization requires \mathcal{O} \left ( d^3 \right ) time and computing the determinate becomes \mathcal{O} \left (d \right) by taking advantage of the fact that \det\left(L L^T\right) = \det(L)^2 = \prod_i L_{i,i}^2. Further, we can precompute the factorization and normalization factor for a given parameterization which leaves us with complexity of the Mahalanobis distance given by the quadratic form in the exponential. Naive computation requires one perform two vector matrix operations and find the inverse of the covariance matrix with worst case behavior \mathcal{O} \left (d^3\right). Leveraging the Cholesky factorization, we’ll end up solving a series of triangular systems by forward and backward substitution in \mathcal{O} \left (d^2\right) and completing an inner product in \mathcal{O} \left (d\right) as given by L z = x - \mu, L^T z = y, and (x-\mu)^T y. Thus, our pre-initialization time is \mathcal{O} \left (d^3 \right) and density determination given by \mathcal{O} \left (N d^2 \right). Further optimizations are possible by considering special diagonal cases of the covariance matrix, such as the isotropic, \Sigma = \sigma I, and non-isotropic, \Sigma_{k,k} = \sigma_k, configurations. For robustness, we’ll stick with the full covariance.

\displaystyle \log p( x \lvert \mu, \Sigma ) = - \frac{1}{2} \left( d \log 2\pi + \log \lvert \Sigma \rvert \right ) - \frac{1}{2} (x - \mu)^{T} \Sigma^{-} (x - \mu)

To avoid numerical issues such as overflow and underflow, we’re going to consider \log p(x \lvert \mu, \Sigma) throughout the remainder of the work. For estimates of the covariance matrix, we will want more samples than the dimension of the data to avoid a singular covariance matrix [4]. Even with this criteria satisfied, it may still be possible to produce a singular matrix if some of the data are collinear and span a subspace of \mathbb{R}^d.

Expectation Maximization

From an unsupervised learning point of view, GMMs can be seen as a generalization of k-means allowing for partial assignment of points to multiple classes. A possible classifier is given by k^{*} = \arg\max_k \, \log \pi_{k} + \log p(x \lvert \mu_k, \Sigma_k). Alternatively, multiple components can be used to represent a single class and we argmax over the corresponding subset sums. The utility of of GMMs goes beyond classification, and can be used for regression as well. The Expectation-Maximization (EM) algorithm will be used to find the parameters of of the model by starting with an initial guess for the parameters given by uniform mixing coefficients, means determined by the k-means algorithm, and spherical covariances for each component. Then, the algorithm iteratively computes probabilities given a fixed set of parameters, then updating those parameters by maximizing the log-likelihood of the data:

\displaystyle \mathcal{L} \left( \mathcal{D} \lvert \mu, \Sigma \right) = \sum_{n = 1}^{N} \log p(x_n) = \sum_{n=1}^{N} \log{ \left [ \sum_{k = 1}^{K} \pi_{k} p \left( x_n \lvert \mu_k, \Sigma_k \right ) \right ] }

Because we are dealing with exponents and logarithms, it’s very easy to end up with underflow and overflow situations, so we’ll continue the trend of working in log-space and also make use of the “log-sum-exp trick” to avoid these complications:

\displaystyle \log p( x ) = a + \log \left[ \sum_{k = 1}^{K} \exp{ \left( \log \pi_{k} + \log p(x \lvert \mu_k, \Sigma_k) - a \right ) } \right ]

Where the a term is the maximum exponential argument within a stated sum. Within the expectation stage of the algorithm we will compute the posterior distributions of the components conditioned on the training data (we omit the mixing coefficient since it cancels out in the maximization steps of \mu_k and \Sigma_k, and account for it explicitly in the update of \pi_k):

\displaystyle \gamma_{k, n} = \frac{ p \left ( x_n \lvert \mu_k, \Sigma_k \right ) }{ p(x) } \qquad \Gamma_k = \sum_{n=1}^{N} \gamma_{k, n}

\displaystyle \log \gamma_{k, n} =  \log p \left ( x_n \lvert \mu_k, \Sigma_k \right )  - \log p(x) \qquad \log \Gamma_k = a + \log \left [ \sum_{n=1}^{N} \exp{ \left( \log \gamma_{k, n} - a \right )} \right ]

The new parameters are resolved within the maximization step:

\displaystyle \pi_{k}^{(t+1)} = \frac{ \pi_{k}^{(t)} \Gamma_k }{ \sum_{i=1}^{K} \pi_{i}^{(t)} \Gamma_i } \qquad \log \pi_{k}^{(t+1)} = \log \pi_{k}^{(t)} + \log \Gamma_k - a - \log \left [ \sum_{i=1}^{K} \exp{ \left( \log \pi_{i}^{(t)} + \log \Gamma_i - a \right )} \right ]

\displaystyle \mu_k^{(t+1)} = \frac{ \sum_{n=1}^{N} x_n \gamma_{n, k} }{ \Gamma_k  } \qquad \mu_k^{(t+1)} = \frac{ \sum_{n=1}^{N} x_n \exp{ \log \gamma_{n, k} } }{ \exp{ \log \Gamma_k }  }

\displaystyle \Sigma_k^{(t+1)} = \frac{ \sum_{n=1}^{N} (x_n - \mu_k^{(t+1)}) (x_n - \mu_k^{(t+1)})^T \gamma_{n, k} }{ \Gamma_k  }

\displaystyle \Sigma_k^{(t+1)} = \frac{ \sum_{n=1}^{N} (x_n - \mu_k^{(t+1)}) (x_n - \mu_k^{(t+1)})^T \exp \log \gamma_{n, k} }{ \exp \log \Gamma_k  }

The algorithm continues back and forth between expectation and maximization stages until the change in log likelihood is less than some epsilon, or a maximum number of user specified iterations has elapsed.


Sequential Per iteration complexity given by \mathcal{O}\left(2 K N d^2 + K N d + 2K + N + K d^3\right). We expect d \le K < N because too many dimensions leads to a lot of dead space and too many components results in overfitting of the data. Thus, the dominating term for sequential execution is given by \mathcal{O}\left(2 K N d^2 \right).

Parallel There are two natural data parallelisms that appear in the algorithm. The calculation of the \mathcal{L} and \gamma across points, while the probability densities and parameter updates have natural parallelisms across components. Each POSIX thread runs the full iterative algorithm with individual stages coordinated by barrier synchronization. Resulting complexity is given by \mathcal{O}\left(\frac{2}{P} d^2 K N \right) for work coordinated across P processors.


Massively Parallel The parallel implementation can be taken and mapped over to the GPU with parallelism taken across points and components depending on the terms being computed. There are several types of parallelism that we will leverage under the CUDA programming model. For the calculation of \log p\left(x | \mu_k, \Sigma_k \right) we compute each point in parallel by forming a grid of one dimensional blocks, and use streams with event synchronization to carry out each component in parallel across the streaming multiprocessors. Calculation of the loglikelihood and \log \gamma_{n,k} is done by computing and storing \log p(x), then updating the storage for \log p\left(x|\mu_k,\Sigma_k\right), and then performing a parallel reduction over \log p(x) to produce the loglikelihood. Parallel reductions are a core tasks are implemented by first standardizing the input array of points to an supremum power of two, then reducing each block using shared memory, and applying a linear map to the memory so that successive block reductions can be applied. Several additional approaches are discussed in [5]. Once the loglikelihood is computed, the streams are synchronized with the host and the result is copied from the device back to the host. To compute \log \Gamma_k, \log \gamma_{n,k} is copied to a working memory and a maximum parallel reduction is performed. The resulting maximum is used in a separate exponential map for numerical stability when computing the parallel reduction of each component to yield \log \Gamma_k. Updates to the mean and covariances are performed by mapping each term to a working memory allocated for each component’s stream and executing a parallel reduction to yield the updated mean and covariance. Once all component streams have been synchronized, the mixture coefficients and Cholesky decompositions of the covariances is computed with a single kernel invocation parallel in the number of components.

The main design consideration was whether or not use streams. For larger numbers of components, this will result in improved runtime performance, however, it comes at the cost of increased memory usage which limits the size of problems an end user can study with the implementation. Because the primary design goal is performance, the increase in memory was favorable to using less memory and executing each component sequentially.

To optimize the runtime of the implementation nvprof along with the NVIDIA Visual Profiler was used to identify performance bottlenecks. The original implementation was a naive port of the parallel C code which required frequent memory transfers between host and device resulting in significant CUDA API overhead that dominated the runtime. By transferring and allocating memory on the device beforehand, this allowed the implementation to execute primarily on the GPU and eliminate the API overhead. The second primary optimization was using streams and events for parallelization of the component probability densities and parameter updates in the maximization step. In doing so, this allowed for a K fold reduction since the components calculations would be performed in parallel. The next optimization step was to streamline the parallel reductions by using block reductions against fast shared block memory minimizing the number of global memory writes instead of performing iterated reductions against sequential addressing that preformed global memory reads and writes for each point. The final optimization step was to used pinned host memory to enable zero-copy transfers from DRAM to the GPU over DMA.


To evaluate the implementations we need a way of generating GMMs and sampling data from the resulting distributions. To sample from a standard univariate normal distribution one can use The Box-Muller transform, Zigguart method, or Ratio-of-uniforms method [7]. The latter is used here due to its simplicity and efficiency. Sampling from the multivariate normal distribution can by done by sampling a standard normal vector \eta \sim \mathcal{N}(0 ,I_d) and computing \mu + \Sigma^{1/2} \eta where \Sigma^{1/2} can be computed by Eigendecomposition, \Sigma^{1/2} = Q \Delta^{1/2} Q^{-}, or Cholesky factorization, \Sigma = L L^T, \Sigma^{1/2} = L. The latter is used since it is more efficient. The GMM describes a generative process whereby we pick a component at random with probability given by its mixture coefficient and then sample the underlying \mathcal{N}(\mu_k, \Sigma_k) distribution, and perform this process for the desired number of points.

The matter of generating GMMs it more interesting. Here we draw \pi_i = X_i / \sum_{j} X_j for X_i \sim \mathcal{U}(0, 1), alternatively, one could draw \pi \sim \text{Dir}(\alpha). Means are drawn by \mu \sim \mathcal{N}(0, a I_d) with a > 1 so that means are relatively spread out in \mathbb{R}^{d}. The more exciting prospect is how to sample the covariance matrix. This is where the Wishart distribution, \Sigma \sim W(I_d, d, n) for n > d - 1, comes in handy. The Wishart distribution is a model of what the sample covariance matrix should look like given a series of n x_i \sim \mathcal{N}(0, I_d) vectors. Based on a \mathcal{O}\left(d^2\right) method by [8], [9] gives an equally efficient method for sampling \Sigma^{1/2} = L by letting L_{i,i} \sim \chi^2(n - i) and L_{i,j} \sim \mathcal{N}(0, 1) for 0 \le i < d and 0 \le j < i.

To evaluate the performance of the different implementations, the wall clock time taken to run the algorithm on a synthetic instance was measured by varying each of the N, K, and d parameters while holding the other two fixed. From an end user perspective wall clock time is preferable to the time the operating system actually devoted to the problem since wall clock time is more valuable. There will be variability in the results since each instance requires a different number of iterations for the log likelihood to converge. Tests were conducted on a Xeon 1245 v5 3.5 Ghz system with 32GB of memory and an NVIDIA GTX 1060 6GB graphics card with 1280 cores.


Since the parameter space is relatively large Figures 2-5 look at varying one parameter will fixing the others to demonstrate the relative merits of each approach. When the number of points dominates the CUDA approach tends to be 18x faster; the Parallel approach tends to be 3x faster when the dimension is high; and CUDA is suitable when the num of components is high giving a 20x improvement relative to the sequential approach. Thus, when dealing with suitably large datasets, the CUDA based implementation is preferable delivering superior runtime performance without sacrificing quality.

It is important to note that the results obtained from the CUDA solution may differ to those the sequential and parallel approaches. This is due to nondeterministic round off errors associated with executing parallel reductions compared to sequential reductions [2], and differences in the handling of floating point values on the GPU [10], notably, the presence of fused multiple add on NVIDIA GPUs which are more accurate than what is frequently implemented in CPU architectures. The following two synthetic data sets illustrate typical results of the three schemes:



This work demonstrated the utility of using NVIDIA GPUs to train Gaussian mixture models by the Expectation Maximization algorithm. Speedups as high as 20x were observed on synthetic datasets by varying the number of points, components, and data dimension while leaving the others fixed. It is believed that further speedups should be possible with additional passes, and the inclusion of metric data structures to limit which data is considered during calculations. Future work would pursue more memory efficient solutions on the GPU to allow for larger problem instance, and focus on providing higher level language bindings so that it can be better utilized in traditional data science toolchains.


  1. Bishop, C. M. Pattern recognition and machine learning. Springer, 2006.
  2. Collange, S., Defour, D., Graillat, S., and Lakymhuk, R. Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Computing 49 (2015), 83-97.
  3. Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the eme algorithm. Journal of the royal statistical society. Series B (methodological) (1977), 1-38.
  4. Fan, J., Liao, Y., and Liu, H. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal 19, (2016) C1-C32.
  5. Harris, M. Optimizing cuda. SC07: High Performance Computing with CUDA (2007).
  6. Kincaid, D., and Cheney, W. Numerical analysis: mathematics of scientific computing. 3 ed. Brooks/Cole, 2002.
  7. Kinderman, A. J., and Monahan, J. F. Computer generation of random variables using the ratio of uniform deviates. ACM Transactions on Mathematical Software (TOMS) 3, 3 (1977), 257-260.
  8. Odell, P., and Feiveson, A. A Numerical procedure to generate a sample covariance matrix. Journal of the American Statistical Association 61, 313 (1966), 199-203.
  9. Sawyer, S. Wishart distributions and inverse-wishart sampling. URL: (2007).
  10. Whitehead, N., and Fit-Florea, A. Precision and performance: Floating point and ieee 754 compliance for nvidia gpus. rn(A + B) 21., 1 (2011), 18749-19424.

An Experiment in Optical Character Recognition

leave a comment »


I’ve always been interested in machine learning and how it can be applied to a number of different problems. I spent some time during college learning some of the techniques used in machine learning, but since then I’ve gotten a bit rusty. So, I revisited the subject by doing some additional research and decided to try my hand at Optical Character Recognition (OCR)- the process of identifying characters within an image and producing a text output. There are a handful of traditional aspects to this process: image acquisition, segmentation, recognition and correction. Acquisition is about correcting an input image so that a segmentation process can be readily applied; segmentation identifies the characters in the image, recognition maps those visual characters to text characters, finally correction takes the text output and rectifies any errors. The following outlines my approach to segmentation and recognition.


Consider the following body of text from one of my earlier posts:

The goal is to extract each of the characters in the image. The simplest way to accomplish this is to implement an algorithm that reads the image much in the same way one might read a block of text: start at the top of the text and scan downward identifying all of the rows of text, then for each row, read all the characters from left to right, then identify words based on white space. Figuring out the word boundaries is done via a simple clustering process. Assuming we have an ordered set of rectangles produced by the first two steps, we can calculate the average distance between consecutive rectangles. Then, once this average has been produced, to then iterate over the list once more adding rectangles to words when the distance between consecutive rectangles is less than the average distance, then creating new words when the distance is exceeded.

This segmentation approach isn’t perfect as it assumes that we are dealing with evenly spaced characters of the same size from the same font. Of course, this isn’t always the case and we have things like kerning and ligatures to deal with. In this example two categories of problems arise: character combinations such as ay, ey and ly that are separable then combinations such as yw, rt and ct that are not separable end up being interpreted as a single character using this method. For the first category, I chose to divide rectangles whenever a line of characters has a single black pixel that does not have a black pixel neighboring ((x-1, y + a) | a <- [-1, 1]) it to the left. The second case isn't as clear cut as all the pixels neighbor one another, in principal one could implement a k-means clustering algorithm, but that assumes you know how many characters (k) you have in the image. I decided to allow the error to propagate through the system.


Starting out, I knew that I wanted to use an Artificial neural network (ANN), so I spent some time reading Stuart’s and Norvig’s “Artificial Intelligence: A Modern Approach“, but felt that it wasn’t as comprehensive as I wanted, so I also read MacKay’s “Information Theory, Inference and Learning Algorithms,” which was more in tune with what I had in mind. I also came across a series (1, 2, 3) of pdf files hosted at Aberdeen’s Robert Gordon University that provided a more practical view of applying neural networks to pattern recognition.

A little background on ANNs: The general idea is that an individual neuron aggregates the weighted inputs from other neurons then outputs a signal. The magnitude of the signal is determined as a function of the aggregation called the activation function. Neurons are organized into layers, typically an input layer, one or more hidden layers and finally an output layer. Values from the input layer and feed into the hidden layer, then those outputs feed into the next and so on until all the values have gone through the output layer. The process of getting the network to map an input to an output is accomplished through training, in this case, a method known as Backpropagation. Given an input and an expected output, the input is feed through the network and produces an output. The difference between the two output vectors is the error that then needs to be feed backward through the network updating each node’s input weights such that the net error of the system is reduced. The method is effectively a gradient descent algorithm. I recommend reading the aforementioned references for details on how it all works. Following is my implementation of the Backpropagation algorithm:

using System;
using System.Linq;
using Library.Mathematics;

namespace OCRProject.Recognizers.NeuralNetworks {
    public class Neuron {
        Func<double, double> activationFunction;

        public Vector InputWeights { get; set; }

        public Neuron(Func<double, double> activationFunction) {
            this.activationFunction = activationFunction;

        public double Output(Vector input) {
            return activationFunction(;

    public class NeuralNetwork {
        private Neuron[] hiddenLayer, outputLayer;


        public Vector Output(Vector input) {
            Vector hiddenLayerOutput = new Vector(hiddenLayer.Length, (i) => hiddenLayer[i].Output(input));
            return new Vector(outputLayer.Length, (i) => outputLayer[i].Output(hiddenLayerOutput));

        public Vector Train(Vector input, Vector desiredOutput, double learningRate) {
            Vector hOutput = new Vector(hiddenLayer.Length, (i) => hiddenLayer[i].Output(input));
            Vector oOutput = new Vector(outputLayer.Length, (i) => outputLayer[i].Output(hOutput));

            Vector oError = new Vector(
                (i) => oOutput[i] * (1 - oOutput[i]) * (desiredOutput[i] - oOutput[i])

            for (int n = 0; n < outputLayer.Length; n++) {
                outputLayer[n].InputWeights = new Vector(
                    (i) => outputLayer[n].InputWeights[i] + learningRate * oError[n] * hOutput[i]

            Vector hError = new Vector(
                (i) => hOutput[i] * (1 - hOutput[i]) * ( Vector(oError.Dimension, (j) => outputLayer[j].InputWeights[i])))

            for (int n = 0; n < hiddenLayer.Length; n++) {
                hiddenLayer[n].InputWeights = new Vector(
                    (i) => hiddenLayer[n].InputWeights[i] + learningRate * hError[n] * input[i]

            return oError;

In terms of how all of this applies to OCR, I started out with all visible characters and produced a collection of 16×16 images. Each image contained a single character centered in the image. This image was then mapped to a 256 dimensional vector with its corresponding character mapped to an 8 dimensional vector representing the expected output. The question that remains is how many hidden layer units should be used. To figure this out, I conducted a small experiment:

Hidden Units 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Minimum (%) 0 4 7 21 36 48 69 80 87 85 92 88 92 89 93 93
Average (%) 1 4 12 28 44 56 72 83 89 90 94 93 94 93 95 95
Maximum (%) 1 6 15 31 51 62 75 87 91 93 99 96 98 95 98 99

Each run consisted of 10 trials with each trial taking 5000 iterations to train the network. (Normally, one can measure the Mean squared error of the network and halt the training process once it is sufficiently small.) After doing this testing, I found that ANNs with 11 hidden units looked to give the highest accuracy with the fewest number of units. Given the original text that was used, the following text was produced:

As expected, the category of errors that were identified earlier (character combinations ff, rt, ct) were not segmented correctly and consequently not recognized correctly.


There is a lot of additional work that could be thrown at the project. In the future, it’d be good to modify the solution to accept any sized font as well as any font, adding support for images containing scanned or photographed text rather than computer generated images and some error correction on the output to flag parts of the text that require review. I suspect that I will continue down this road and start investigating how these methods can be applied to general computer vision for some planned robotics projects.