# Antimatroid, The

thoughts on computer science, electronics, mathematics

## GPU Accelerated Expectation Maximization for Gaussian Mixture Models using CUDA

C, CUDA, and Python source code available on GitHub

### Introduction

Gaussian Mixture Models [1, 435-439] offer a simple way to capture complex densities by employing a linear combination of $K$ multivariate normal distributions, each with their own mean, covariance, and mixture coefficient, $\pi_{k}$, s.t. $\sum_{k} \pi_{k} = 1$.

$\displaystyle p( x ) = \sum_{k = 1}^{K} \pi_{k} p(x \lvert \mu_k, \Sigma_k)$

Of practical interest is the learning of the number of components and the values of the parameters. Evaluation criteria, such as Akaike and Bayesian, can be used to identify the number of components, or non-parametric models like Dirichlet processes can be used to avoid the matter all together. We won’t cover these techniques here, but will instead focus on finding the values of the parameters given sufficient training data using the Expectation-Maximization algorithm [3], and doing so efficiently on the GPU. Technical considerations will be discussed and the work will conclude with an empirical evaluation of sequential and parallel implementations for the CPU, and a massively parallel implementation for the GPU for varying numbers of components, points, and point dimensions.

### Multivariate Normal Distribution

The multivariate normal distribution With mean, $\mu \in \mathbb{R}^d, d \in \mathbb{N}_1$, and symmetric, positive definite covariance, $\Sigma \in \mathbb{R}^{d \times d}$, is given by:

$\displaystyle p( x \lvert \mu, \Sigma ) = \frac{1}{\sqrt{(2\pi)^d \lvert \Sigma \rvert }} \exp{\left( - (x - \mu)^{T} \Sigma^{-} (x - \mu) / 2 \right)}$

From a computational perspective, we will be interested in evaluating the density for $N$ values. Thus, a naive implementation would be bounded by $\mathcal{O}\left(N d^4\right)$ due to the matrix determinate in the normalization term. We can improve upon this by computing the Cholesky factorization, $\Sigma = L L^T$, where $L$ is a lower triangular matrix [6, 157-158]. The factorization requires $\mathcal{O} \left ( d^3 \right )$ time and computing the determinate becomes $\mathcal{O} \left (d \right)$ by taking advantage of the fact that $\det\left(L L^T\right) = \det(L)^2 = \prod_i L_{i,i}^2$. Further, we can precompute the factorization and normalization factor for a given parameterization which leaves us with complexity of the Mahalanobis distance given by the quadratic form in the exponential. Naive computation requires one perform two vector matrix operations and find the inverse of the covariance matrix with worst case behavior $\mathcal{O} \left (d^3\right)$. Leveraging the Cholesky factorization, we’ll end up solving a series of triangular systems by forward and backward substitution in $\mathcal{O} \left (d^2\right)$ and completing an inner product in $\mathcal{O} \left (d\right)$ as given by $L z = x - \mu$, $L^T z = y$, and $(x-\mu)^T y$. Thus, our pre-initialization time is $\mathcal{O} \left (d^3 \right)$ and density determination given by $\mathcal{O} \left (N d^2 \right)$. Further optimizations are possible by considering special diagonal cases of the covariance matrix, such as the isotropic, $\Sigma = \sigma I$, and non-isotropic, $\Sigma_{k,k} = \sigma_k$, configurations. For robustness, we’ll stick with the full covariance.

$\displaystyle \log p( x \lvert \mu, \Sigma ) = - \frac{1}{2} \left( d \log 2\pi + \log \lvert \Sigma \rvert \right ) - \frac{1}{2} (x - \mu)^{T} \Sigma^{-} (x - \mu)$

To avoid numerical issues such as overflow and underflow, we’re going to consider $\log p(x \lvert \mu, \Sigma)$ throughout the remainder of the work. For estimates of the covariance matrix, we will want more samples than the dimension of the data to avoid a singular covariance matrix [4]. Even with this criteria satisfied, it may still be possible to produce a singular matrix if some of the data are collinear and span a subspace of $\mathbb{R}^d$.

### Expectation Maximization

From an unsupervised learning point of view, GMMs can be seen as a generalization of k-means allowing for partial assignment of points to multiple classes. A possible classifier is given by $k^{*} = \arg\max_k \, \log \pi_{k} + \log p(x \lvert \mu_k, \Sigma_k)$. Alternatively, multiple components can be used to represent a single class and we argmax over the corresponding subset sums. The utility of of GMMs goes beyond classification, and can be used for regression as well. The Expectation-Maximization (EM) algorithm will be used to find the parameters of of the model by starting with an initial guess for the parameters given by uniform mixing coefficients, means determined by the k-means algorithm, and spherical covariances for each component. Then, the algorithm iteratively computes probabilities given a fixed set of parameters, then updating those parameters by maximizing the log-likelihood of the data:

$\displaystyle \mathcal{L} \left( \mathcal{D} \lvert \mu, \Sigma \right) = \sum_{n = 1}^{N} \log p(x_n) = \sum_{n=1}^{N} \log{ \left [ \sum_{k = 1}^{K} \pi_{k} p \left( x_n \lvert \mu_k, \Sigma_k \right ) \right ] }$

Because we are dealing with exponents and logarithms, it’s very easy to end up with underflow and overflow situations, so we’ll continue the trend of working in log-space and also make use of the “log-sum-exp trick” to avoid these complications:

$\displaystyle \log p( x ) = a + \log \left[ \sum_{k = 1}^{K} \exp{ \left( \log \pi_{k} + \log p(x \lvert \mu_k, \Sigma_k) - a \right ) } \right ]$

Where the $a$ term is the maximum exponential argument within a stated sum. Within the expectation stage of the algorithm we will compute the posterior distributions of the components conditioned on the training data (we omit the mixing coefficient since it cancels out in the maximization steps of $\mu_k$ and $\Sigma_k$, and account for it explicitly in the update of $\pi_k$):

$\displaystyle \gamma_{k, n} = \frac{ p \left ( x_n \lvert \mu_k, \Sigma_k \right ) }{ p(x) } \qquad \Gamma_k = \sum_{n=1}^{N} \gamma_{k, n}$

$\displaystyle \log \gamma_{k, n} = \log p \left ( x_n \lvert \mu_k, \Sigma_k \right ) - \log p(x) \qquad \log \Gamma_k = a + \log \left [ \sum_{n=1}^{N} \exp{ \left( \log \gamma_{k, n} - a \right )} \right ]$

The new parameters are resolved within the maximization step:

$\displaystyle \pi_{k}^{(t+1)} = \frac{ \pi_{k}^{(t)} \Gamma_k }{ \sum_{i=1}^{K} \pi_{i}^{(t)} \Gamma_i } \qquad \log \pi_{k}^{(t+1)} = \log \pi_{k}^{(t)} + \log \Gamma_k - a - \log \left [ \sum_{i=1}^{K} \exp{ \left( \log \pi_{i}^{(t)} + \log \Gamma_i - a \right )} \right ]$

$\displaystyle \mu_k^{(t+1)} = \frac{ \sum_{n=1}^{N} x_n \gamma_{n, k} }{ \Gamma_k } \qquad \mu_k^{(t+1)} = \frac{ \sum_{n=1}^{N} x_n \exp{ \log \gamma_{n, k} } }{ \exp{ \log \Gamma_k } }$

$\displaystyle \Sigma_k^{(t+1)} = \frac{ \sum_{n=1}^{N} (x_n - \mu_k^{(t+1)}) (x_n - \mu_k^{(t+1)})^T \gamma_{n, k} }{ \Gamma_k }$

$\displaystyle \Sigma_k^{(t+1)} = \frac{ \sum_{n=1}^{N} (x_n - \mu_k^{(t+1)}) (x_n - \mu_k^{(t+1)})^T \exp \log \gamma_{n, k} }{ \exp \log \Gamma_k }$

The algorithm continues back and forth between expectation and maximization stages until the change in log likelihood is less than some epsilon, or a maximum number of user specified iterations has elapsed.

### Implementations

Sequential Per iteration complexity given by $\mathcal{O}\left(2 K N d^2 + K N d + 2K + N + K d^3\right)$. We expect $d \le K < N$ because too many dimensions leads to a lot of dead space and too many components results in overfitting of the data. Thus, the dominating term for sequential execution is given by $\mathcal{O}\left(2 K N d^2 \right)$.

Parallel There are two natural data parallelisms that appear in the algorithm. The calculation of the $\mathcal{L}$ and $\gamma$ across points, while the probability densities and parameter updates have natural parallelisms across components. Each POSIX thread runs the full iterative algorithm with individual stages coordinated by barrier synchronization. Resulting complexity is given by $\mathcal{O}\left(\frac{2}{P} d^2 K N \right)$ for work coordinated across $P$ processors.

Massively Parallel The parallel implementation can be taken and mapped over to the GPU with parallelism taken across points and components depending on the terms being computed. There are several types of parallelism that we will leverage under the CUDA programming model. For the calculation of $\log p\left(x | \mu_k, \Sigma_k \right)$ we compute each point in parallel by forming a grid of one dimensional blocks, and use streams with event synchronization to carry out each component in parallel across the streaming multiprocessors. Calculation of the loglikelihood and $\log \gamma_{n,k}$ is done by computing and storing $\log p(x)$, then updating the storage for $\log p\left(x|\mu_k,\Sigma_k\right)$, and then performing a parallel reduction over $\log p(x)$ to produce the loglikelihood. Parallel reductions are a core tasks are implemented by first standardizing the input array of points to an supremum power of two, then reducing each block using shared memory, and applying a linear map to the memory so that successive block reductions can be applied. Several additional approaches are discussed in [5]. Once the loglikelihood is computed, the streams are synchronized with the host and the result is copied from the device back to the host. To compute $\log \Gamma_k$, $\log \gamma_{n,k}$ is copied to a working memory and a maximum parallel reduction is performed. The resulting maximum is used in a separate exponential map for numerical stability when computing the parallel reduction of each component to yield $\log \Gamma_k$. Updates to the mean and covariances are performed by mapping each term to a working memory allocated for each component’s stream and executing a parallel reduction to yield the updated mean and covariance. Once all component streams have been synchronized, the mixture coefficients and Cholesky decompositions of the covariances is computed with a single kernel invocation parallel in the number of components.

The main design consideration was whether or not use streams. For larger numbers of components, this will result in improved runtime performance, however, it comes at the cost of increased memory usage which limits the size of problems an end user can study with the implementation. Because the primary design goal is performance, the increase in memory was favorable to using less memory and executing each component sequentially.

To optimize the runtime of the implementation nvprof along with the NVIDIA Visual Profiler was used to identify performance bottlenecks. The original implementation was a naive port of the parallel C code which required frequent memory transfers between host and device resulting in significant CUDA API overhead that dominated the runtime. By transferring and allocating memory on the device beforehand, this allowed the implementation to execute primarily on the GPU and eliminate the API overhead. The second primary optimization was using streams and events for parallelization of the component probability densities and parameter updates in the maximization step. In doing so, this allowed for a $K$ fold reduction since the components calculations would be performed in parallel. The next optimization step was to streamline the parallel reductions by using block reductions against fast shared block memory minimizing the number of global memory writes instead of performing iterated reductions against sequential addressing that preformed global memory reads and writes for each point. The final optimization step was to used pinned host memory to enable zero-copy transfers from DRAM to the GPU over DMA.

### Evaluation

To evaluate the implementations we need a way of generating GMMs and sampling data from the resulting distributions. To sample from a standard univariate normal distribution one can use The Box-Muller transform, Zigguart method, or Ratio-of-uniforms method [7]. The latter is used here due to its simplicity and efficiency. Sampling from the multivariate normal distribution can by done by sampling a standard normal vector $\eta \sim \mathcal{N}(0 ,I_d)$ and computing $\mu + \Sigma^{1/2} \eta$ where $\Sigma^{1/2}$ can be computed by Eigendecomposition, $\Sigma^{1/2} = Q \Delta^{1/2} Q^{-}$, or Cholesky factorization, $\Sigma = L L^T, \Sigma^{1/2} = L$. The latter is used since it is more efficient. The GMM describes a generative process whereby we pick a component at random with probability given by its mixture coefficient and then sample the underlying $\mathcal{N}(\mu_k, \Sigma_k)$ distribution, and perform this process for the desired number of points.

The matter of generating GMMs it more interesting. Here we draw $\pi_i = X_i / \sum_{j} X_j$ for $X_i \sim \mathcal{U}(0, 1)$, alternatively, one could draw $\pi \sim \text{Dir}(\alpha)$. Means are drawn by $\mu \sim \mathcal{N}(0, a I_d)$ with $a > 1$ so that means are relatively spread out in $\mathbb{R}^{d}$. The more exciting prospect is how to sample the covariance matrix. This is where the Wishart distribution, $\Sigma \sim W(I_d, d, n)$ for $n > d - 1$, comes in handy. The Wishart distribution is a model of what the sample covariance matrix should look like given a series of $n$ $x_i \sim \mathcal{N}(0, I_d)$ vectors. Based on a $\mathcal{O}\left(d^2\right)$ method by [8], [9] gives an equally efficient method for sampling $\Sigma^{1/2} = L$ by letting $L_{i,i} \sim \chi^2(n - i)$ and $L_{i,j} \sim \mathcal{N}(0, 1)$ for $0 \le i < d$ and $0 \le j < i$.

To evaluate the performance of the different implementations, the wall clock time taken to run the algorithm on a synthetic instance was measured by varying each of the $N$, $K$, and $d$ parameters while holding the other two fixed. From an end user perspective wall clock time is preferable to the time the operating system actually devoted to the problem since wall clock time is more valuable. There will be variability in the results since each instance requires a different number of iterations for the log likelihood to converge. Tests were conducted on a Xeon 1245 v5 3.5 Ghz system with 32GB of memory and an NVIDIA GTX 1060 6GB graphics card with 1280 cores.

Since the parameter space is relatively large Figures 2-5 look at varying one parameter will fixing the others to demonstrate the relative merits of each approach. When the number of points dominates the CUDA approach tends to be 18x faster; the Parallel approach tends to be 3x faster when the dimension is high; and CUDA is suitable when the num of components is high giving a 20x improvement relative to the sequential approach. Thus, when dealing with suitably large datasets, the CUDA based implementation is preferable delivering superior runtime performance without sacrificing quality.

It is important to note that the results obtained from the CUDA solution may differ to those the sequential and parallel approaches. This is due to nondeterministic round off errors associated with executing parallel reductions compared to sequential reductions [2], and differences in the handling of floating point values on the GPU [10], notably, the presence of fused multiple add on NVIDIA GPUs which are more accurate than what is frequently implemented in CPU architectures. The following two synthetic data sets illustrate typical results of the three schemes:

### Conclusion

This work demonstrated the utility of using NVIDIA GPUs to train Gaussian mixture models by the Expectation Maximization algorithm. Speedups as high as 20x were observed on synthetic datasets by varying the number of points, components, and data dimension while leaving the others fixed. It is believed that further speedups should be possible with additional passes, and the inclusion of metric data structures to limit which data is considered during calculations. Future work would pursue more memory efficient solutions on the GPU to allow for larger problem instance, and focus on providing higher level language bindings so that it can be better utilized in traditional data science toolchains.

### References

1. Bishop, C. M. Pattern recognition and machine learning. Springer, 2006.
2. Collange, S., Defour, D., Graillat, S., and Lakymhuk, R. Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Computing 49 (2015), 83-97.
3. Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the eme algorithm. Journal of the royal statistical society. Series B (methodological) (1977), 1-38.
4. Fan, J., Liao, Y., and Liu, H. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal 19, (2016) C1-C32.
5. Harris, M. Optimizing cuda. SC07: High Performance Computing with CUDA (2007).
6. Kincaid, D., and Cheney, W. Numerical analysis: mathematics of scientific computing. 3 ed. Brooks/Cole, 2002.
7. Kinderman, A. J., and Monahan, J. F. Computer generation of random variables using the ratio of uniform deviates. ACM Transactions on Mathematical Software (TOMS) 3, 3 (1977), 257-260.
8. Odell, P., and Feiveson, A. A Numerical procedure to generate a sample covariance matrix. Journal of the American Statistical Association 61, 313 (1966), 199-203.
9. Sawyer, S. Wishart distributions and inverse-wishart sampling. URL: http://www.math.wustl.edu/~sawyer/hmhandouts/Wishart.pdf (2007).
10. Whitehead, N., and Fit-Florea, A. Precision and performance: Floating point and ieee 754 compliance for nvidia gpus. rn(A + B) 21., 1 (2011), 18749-19424.

Written by lewellen

2017-04-22 at 11:36 am

## A Greedy Approximation Algorithm for the Linear Assignment Problem

Starting today, I will be posting some of the related source code for articles on GitHub.

### Introduction

The Linear Assignment Problem (LAP) is concerned with uniquely matching an equal number of workers to tasks, $n$, such that the overall cost of the pairings is minimized. A polynomial time algorithm was developed in the late fifties by [6], and further refined by [9], called the Hungarian method. Named so after the work of Hungarian mathematicians KÃ¶nig and EgervÃ¡ry whose theorems in the 1930s form the basis for the method. While the Hungarian Method can solve LAP instances in $\mathcal{O}\left(n^3\right)$ time, we wish to find faster algorithms even if it means sacrificing optimality in the process. Here we examine a greedy $\alpha$-approximation algorithm with $\mathcal{O}\left(n^2 \log n \right)$ runtime in terms of its approximation factor and compare it empirically to the Hungarian method.

### Linear Assignment Problem

\displaystyle \begin{aligned} C_n = \min & \sum_{i=1}^{n} \sum_{j=1}^{n} M_{i,j} x_{i,j} \\ s.t. & \sum_{i=1}^{n} x_{i,j} = 1, \quad j = 1, \ldots, n \\ & \sum_{j=1}^{n} x_{i,j} = 1, \quad i = 1, \dots, n \label{eqn:lap} \end{aligned}

The above linear program has cost, $M \in \mathbb{Z}_{+}^{n \times n}$, and assignment, $x \in \lbrace 0,1 \rbrace^{n \times n}$, matrices that specify the terms of the LAP. This is equivalent to finding a perfect matching in a weighted bipartite graph. A minimal cost may have several possible assignments, but we are only interested in finding just one. It is assumed that no one worker can do all jobs more efficiently by themselves than the distributing work across all workers. Likewise, if the costs are thought of as durations, then the minimum cost is the minimum sequential rather than parallel time taken to complete the tasks.

From a practical point of view, we may relax the integral constraint on $M$ and allow all positive real-valued costs. For instances where there are more jobs than workers, and vice versa, dummy entries valued greater than the existing maximum may be added. Minimizing the cost is the default objective, but the maximum cost can be found by finding the optimal assignment for $M^{\prime}_{i,j} = M_{max} - M_{i,j}$, then finding the cost relative to $M$.

### Algorithms

Brute Force Rather than using the mathematical programming or graph theoretic representation of the problem, we can instead view the problem as finding the assignment that minimizes the cost out of all possible assignments:

$\displaystyle \pi^{*} = \underset{\pi \in \Pi_n}{\arg\min} \sum_{i=1}^{n} M_{i, \pi_i}$

There are $n!$ such assignments which can be produced using an iterative version of Heap’s algorithm [5] in $\mathcal{O}\left(n!\right)$ time assuming one does differential scoring (opposed to calculating the score for each permutation which would result in an $\mathcal{O}\left(n^2 (n-1)!\right)$ algorithm.)

Random The random algorithm selects a permutation $\pi \in \Pi_n$ uniformly from the set of all possible assignment permutations in $\mathcal{O}\left(n\right)$ time using the Fisher-Yates shuffle [4]. This obviously does not produce an optimal or near-optimal solution, but serves as a straw man to compare other results.

Greedy The greedy heuristic continues to cover the row and column of the smallest uncovered entry in the cost matrix until all entries are covered. The resulting set of entries then constitutes the assignment of workers to jobs. An inefficient $\mathcal{O}\left(n^3\right)$ algorithm can be used to find the smallest entry every iteration, or a more efficient result of $\mathcal{O}\left(n^2 \log n\right)$ can be obtained through the use of a sorted, array indexed hybrid mesh and queue. Let $\texttt{QNode}$ represent a tuple consisting of row, column, and value; the previous entry in the matrix $\le$ this value, and the next entry in this matrix $\ge$ this value; and the $\texttt{QNode}s$ (left, above, right, below) that are adjacent to this node.

Algorithm 1 A greedy algorithm for the LAP.

• $\textbf{procedure } \textsc{Greedy}(M)$
• $A[i] \gets \bot \text{ for } i = 0 \ldots n - 1$
• $Q[i] \gets \texttt{QNode} \text{ for } i = 0 \ldots n^2 - 1$
• $\textsc{LinkMesh}(Q)$ // Adjacent node left, above, right, below properties
• $\textsc{Sort}(Q)$ // Sort in ascending order by node value
• $\textsc{LinkQueue}(Q)$ // Adjacent node previous and next properties
• $\qquad Q_{min} \gets Q[0]$
• $\textbf{while } Q_{min} \neq nil \textbf{ do}$
• $A[ Q_{min} \rightarrow row ] \gets Q_{min} \rightarrow col$
• $Q_{min} \gets \textsc{DeleteNode}(Q, Q_{min})$ // Deletes row and col of $Q_{min}$
• $\textbf{end while}$
• $\qquad \textbf{return } A$

Allocating and linking for assignment is $\mathcal{O}\left(n\right)$; mesh $\mathcal{O}\left(n^2\right)$; queue $\mathcal{O}\left(2n^2\log n + n^2\right)$. Therefore, initialization requires $\mathcal{O}\left(n^2 \log n\right)$ time. The body of the loop requires a constant time assignment of worker to job, and $\mathcal{O}\left(2k - 1\right)$ time to remove the row and column from a $k \times k$ matrix using a modified depth first search. Thus, the loop itself accounts for $\mathcal{O}\left(n^2\right)$ time. The resulting time complexity is therefore $\mathcal{O}\left(n^2 \log n\right) \square$.

$\displaystyle \begin{pmatrix} 62 & 31 & 79 & \fbox{6} & 21 & 37 \\ 45 & 27 & 23 & 66 & \fbox{9} & 17 \\ 83 & 59 & 25 & 38 & 63 & \fbox{25} \\ \fbox{1} & 37 & 53 & 100 & 80 & 51 \\ 69 & \fbox{72} & 74 & 32 & 82 & 31 \\ 34 & 95 & \fbox{61} & 64 & 100 & 82 \\ \end{pmatrix} \quad \begin{pmatrix} 62 & 31 & 79 & \fbox{6} & 21 & 37 \\ 45 & 27 & 23 & 66 & \fbox{9} & 17 \\ 83 & 59 & \fbox{25} & 38 & 63 & 25 \\ \fbox{1} & 37 & 53 & 100 & 80 & 51 \\ 69 & 72 & 74 & 32 & 82 & \fbox{31} \\ 34 & \fbox{95} & 61 & 64 & 100 & 82 \\ \end{pmatrix}$

Breaking ties for the minimum uncovered value can result in different costs. This drawback is shown in the above example were choosing $25$ at $(3,6)$ yields a minimum cost of $174$, where as the one at $(3, 3)$ gives a minimum cost of $167$. The next progression in the design of the greedy algorithm would be to try all minimum positions and keep the top $k$ performing paths.

Hungarian The general idea behind the Kuhn-Munkres algorithm is that if we are given an initial assignment, we can make further assignments and potentially reassign workers until all workers have been tasked with a job. The high-level sketch of the algorithm starts with an initial assignment. While we have jobs that are unassigned, we look for qualified workers, ie, the zero entries. If a worker is already assigned to a job, but is also qualified for another, then we prime the alternative and continue to the next qualified worker, but if that is the only job the worker is qualified for, then we’d like to reassign any other worker already tasked to that job. This leads to a natural ripple effect represented by an alternating path of starred and primed entries. In Munkres’ paper [9] “starred” zero’s represent assignments of workers to jobs, and “primed” zero’s are alternative assignments. By flipping the bits of the path, we reassign workers to their alternative tasks while ensuring the assignment continues to be minimal by construction. After assigning as many workers as we have to, we then deduct the lowest cost to create a new qualified worker. Thus, every iteration we are guaranteed to make positive progress towards our goal of finding an optimal assignment. This scheme results in the worst case $\mathcal{O}\left(n^3\right)$ time to complete.

Algorithm 2 The Hungarian method for the LAP.

• $\textbf{procedure } \textsc{HungarianMethod}(M)$
• $M_{i,j} \gets M_{i,j} - \min_j M_{i,j} \text{ for } i = 0 \ldots n - 1$
• $M_{i,j} \gets M_{i,j} - \min_i M_{i,j} \text{ for } j = 0 \ldots n - 1$
• Star the first uncovered zero in row $i$, cover the corresponding column $j$ for $i = 0 \ldots n - 1$
• $\textbf{while }$ All columns not covered
• $\textbf{while }$ Uncovered zeros
• Prime the current uncovered zero
• $\textbf{if }$ There’s a starred zero in this row
• Uncover the starred zero’s column and cover the row
• $\textbf{else }$
• Find an alternating augmented path from the primed zero
• Unstar the starred zeros on the path and star the primed zeros on the path
• Remove all the prime markings and cover all stared zeros
• $\textbf{break}$
• $\textbf{end if}$
• $\textbf{end while}$
• $\textbf{if }$ Found path
• $\textbf{continue}$
• $\textbf{end if}$
• $M^* = \min M_{i,j}$ over all uncovered $i, j$
• $M_{i,j} = M_{i,j} - M^*$ for all uncovered columns $j$
• $M_{i,j} = M_{i,j} + M^*$ for all covered rows $i$
• $\textbf{end while }$
• $\textbf{return}$ Starred zeros // These are all the assignments
• $\textbf{end procedure}$

To further illustrate the algorithm, consider the following example where starred entries are denoted by red, and primed entries by green:

### Analysis

The prevailing convention in the literature is to look at the approximation factor, $\alpha$, to determine how close the results of an approximation algorithm are to optimal [10]. Here this ratio is the expected minimum cost assignment of the algorithm under test to the same quantity given by the expected minimum assignment cost. Let $M_{i,j} \sim \text{Exp}(1)$ be an $n \times n$ a standard exponential random cost matrix. We resort to the exponential distribution for its ease of analyis and prominence in related literature. Cf. the works of [7], [8] for analysis based on $M_{i,j} \sim \mathcal{U}(0,1)$.

Exponential Distribution Properties Let $X \sim \text{Exp}(\lambda)$ have cumulative distribution function $F_X(x) = 1 - \exp{\left(-\lambda x\right)}$ and expectation $\mathbb{E}(X) = \lambda^{-}$. The distribution demonstrates the memoryless property for expectations $\mathbb{E}(X \lvert X > a) = \mathbb{E}(X) + a$. Define the order statistic $X_{1:n} = \min \lbrace X_{1}, \ldots, X_{n} \rbrace$ to be the minimum of $n$ draws from $\text{Exp}(\lambda)$. $X_{1:n} \sim \text{Exp}(n \lambda)$ [2] with expectation $\mathbb{E}(X_{1:n}) = \left(n \lambda\right)^-$. If $Y_n = \sum_{i = 1}^{n} X_i$ then $Y_n \sim \text{Gamma}(n, \lambda)$ with expectation $\mathbb{E}(Y_n) = n \lambda^{-}$.

Expected Minimum Cost The expected minimum assignment cost for $M$ is given by [1]:

$\displaystyle \mathbb{E}(C_n) = \sum_{k = 1}^{n} \frac{1}{k^2} = H_{n}^{(2)}$

Which is the generalized harmonic number of order two and converges to $\zeta(2) = \pi^2/6$. For the generalized harmonic numbers, $H_{n}^{(k)}$, $\lim_{k\to\infty} H_{n}^{(k)} = \zeta(k)$ for $k > 1$.

Greedy The minimum value of an $n \times n$ matrix is given by the order statistic $M_{1:n^2}$ with expectation $\mathbb{E}(M_{1:n^2}) = n^{-2}$. The expected value of the minimum cost assignment is not just $\sum_{i=0}^{n-1} (n-i)^{-2}$ because the expectation doesn’t take into account the previous iteration’s minimum value. To accomplish this we make use of the memoryless property of the exponential distribution to observe that the expected difference in minimums between iterations is the expected minimum value given by $M_{i:k^2}$. If we add up all these differences we get the expected minimum value of the k’th iteration; summing all these expectations then yields the expected minimum cost assignment:

$\displaystyle \mathbb{E}(C_n) = \sum_{i=0}^{n-1} \sum_{j=0}^{i} \frac{1}{(n - j)^2} = \sum_{j=0}^{n-1} \frac{(n-j)}{(n-j)^2} = \sum_{j=0}^{n-1} \frac{1}{n-j} = H_n$

This is the harmonic number of order one which does not converge. The resulting approximation factor is:

$\displaystyle \alpha_n = \frac{H_n}{H_n^{(2)}}$

Random The random algorithm will simply select an assignment permutation, so we are just adding up $n$ $\text{Exp}(1)$ distributed random variables leading to an expected cost of:

$\displaystyle \mathbb{E}(C_n) = \sum_{i=1}^n \mathbb{E}(M_{i, \pi_i}) = n$

And approximation factor:

$\displaystyle \alpha_n = \frac{n}{H_n^{(2)}}$

From this analysis one concludes that the greedy algorithm has an unbounded approximation factor that grows significantly slower than that of randomly selecting assignments.

### Evaluation

To illustrate the preceding results, Figure 1 shows the approximation factor for the greedy algorithm implementations against the derived approximation factor. The simulated results are based on 120 $n \times n$ standard exponentially distributed matrices for $1 \le n \le 1000$. Using the same conventions for the approximation factor, Figure 2 illustrates the runtime characteristics of the algorithms after rejecting outliers due to system fluctuations. Results obtained from source code compiled with -O3 flags and ran on a Xeon E3-1245 v5 3.5 Ghz system with 32 GBs of 2133Mhz DDR4 RAM. The algorithms coincide with the theoretical time complexities as shown in Table 2.

Solver MSE
GREEDY-EFFICIENT 0.002139
GREEDY-NAIVE 0.014161
HUNGARIAN 0.232998
Table 1: Mean square error of fitted model to mean runtime for each solver. Models given by the corresponding time complexity. Fit by Levenberg-Marquardt.

### Summary

Brute Random Greedy Hungarian
Complexity $\mathcal{O}\left(n!\right)$ $\mathcal{O}\left(n\right)$ $\mathcal{O}\left(n^2 \log n\right)$ $\mathcal{O}\left(n^3\right)$
$\alpha_n$ 1 $n / H_n^{(2)}$ $H_n / H_n^{(2)}$ 1
Table 2: Merits of each approach.

Exact solutions can be delivered by the brute method when a handful of workers are being considered, and the Hungarian method should be considered for all other instances. Approximate solutions can be provided by the greedy algorithm with logarithmic degeneracy while providing a linear factor improvement over the Hungarian method. For inputs greater than those considered, the parallel Auction algorithm [3] is a suitable alternative and the subject of future work.

### References

1. Aldous, D. J. The $\zeta(2)$ limit in the random assignment problem. Random Structures & Algorithms 18, 4 (2001), 381-418.
2. Balakrishnan, N., and Rao, C. Handbook of statistics 16: Order statistics-theory and methods, 2000.
3. Bertsekas, D. P. The auction algorithm: A distributed relaxation method for the assignment problem. Annals of operation research 4, 1 (1988), 105-123.
4. Durtenfeld, R. Algorithm 235: random permutation. Communications of the ACM 7, 7 (1964), 420.
5. Heap, B. Permutations by interchanges. The Computer Journal 6, 3 (1963), 293-298.
6. Kuhn, H. W. The hungarian method for the assignment problem. Naval research logistics quarterly 2, 1-2 (1955), 83097.
7. Kurtzberg, J. M. On approximation methods for the assignment problem. Journal of the ACM (JACM) 9, 4 (1962), 419-439.
8. Steele, M. J. Probability and statistics in the service of computer science: illustrations using the assignment problem. Communications in Statistics-Theory and Methods 19, 11 (1990), 4315-4329.
9. Munkres, J. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5, 1 (1957), 32-38.
10. Williamson, D. P., and Shmoys, D. B. The design of approximation algorithms. Cambridge university press, 2011.

Written by lewellen

2017-03-21 at 11:12 am

## End of the Grad School Era

with one comment

If you find a path with no obstacles, it probably doesn’t lead anywhere. ~ Frank A. Clark

After a year-and-a-half of intense work, I’m happy to announce that I’ve completed my Master of Science (M.S.) in Computer Science. It has been a unique experience full of challenges and opportunities that I did not fully anticipate when I quit my job as a Senior Software Developer last year. I met a diverse group of people at different points in their lives who all shared a common goal- to seek out new knowledge and improve themselves; I among them. But it also meant studying something I’m passionate about, and I’m looking forward to putting into practice my new found machine learning knowledge.

When I announced going back to school I talked about making a jump; this idea that there was something better waiting for me out there. After being in the air for the past year-and-a-half, I’ve landed on the west coast where I’ll be writing software related to artificial intelligence research. It’s a departure from the life I’ve built here in Colorado, but it’s also an opportunity to continue advancing my knowledge and cultivating expertise in a unique area; it is a jump to bigger and better things that I hope will be a rewarding experience.

It’s tough to say what opportunities I’ll get to work on side projects in the coming year, but I do hope to work on some miscellaneous projects related to clustering GPS trajectories; put some time into my fitness tracking application Viderefit; and put more time into GPU-based speech recognition system I’ve been steadily working on. As always, check out the archive for old posts, and subscribe by e-mail (see side column), RSS, or Twitter (@antimatroidthe) for the latest happenings.

Written by lewellen

2016-01-19 at 8:00 am

Posted in Announcements

Tagged with

## Distributed k-Means Clustering

Abstract
k-Means Clustering [10] is a fundamental algorithm in machine learning, and often the first approach a user will try when they want to discover the natural groupings in a collection of n-dimensional vectors. The algorithm iteratively picks cluster centers by assigning vectors to their closest cluster, then recalculates the centers based on the assigned vectors’ means. Here it is used as a toy algorithm to motivate the pilot study, design, implementation, and evaluation of a fault-tolerant distributed system with emphasis on efficient data transfer using remote direct memory access (RDMA), and a distributed synchronization barrier that accommodates transient workers. Experiments will evaluate the performance of these interests and quantify how well the system compares to Apache Spark before concluding with a discussion on transforming this pilot study into a high-performance, cross-platform, fault-tolerant distributed system for machine learning.

### Introduction

The applications of k-Means clustering are numerous in business, engineering, and science where demands for prompt insights and support for increasingly large volumes of data motivate the need for a distributed system that exploits the inherent parallelism of the algorithm on today’s emerging hardware accelerators (e.g., FPGA [3], GPU [15], many integrated core, multi-core [9]).

There are however a number questions that arise when building such a system: what accelerators should be supported, how to partition data in an unbiased way, how to distribute those partitions, what each participant will calculate from their partition, how those individual results will be aggregated, and finally, how to synchronize tasks between participants to complete the job as a whole.

Two of these questions will be the focus of this work: how to efficiently transfer data to workers, and how to synchronize them in the presence of transient workers. For the former, remote direct memory access (RDMA) is used to distribute disk-based data from a coordinator to workers, and an extension to synchronization barriers is developed to allow workers to leave ongoing calculations, and return without interrupting other workers.

How well the system solves these issues will be measured by observed transfer rates and per iteration synchronization times. To understand the system’s scalability and place in the broader distributed machine learning landscape, its runtime performance will be evaluated against Apache Spark. Based on these results, future work will be discussed on how to move forward with the study to create a system that can meet the ever growing demands of users.

### Background

$1: \textbf{procedure } \textsc{k-Means}(C, X, d, N, K) \newline 2: \qquad \textbf{while } \text{not convered } \textbf{do } \newline 3: \qquad \qquad S_k \gets \{ x : k = \text{argmin}_{i} \lVert c_i - x \rVert_2, x \in X \} \newline 4: \qquad \qquad \Sigma_{k} \gets \sum_{x \in S_k} x \newline 5: \qquad \qquad \kappa_{k} \gets \lVert S_k \rVert \newline 6: \qquad \qquad c_{k} = \Sigma_{k} / \kappa_{k} \newline 7: \qquad \textbf{end while} \newline 8: \qquad \textbf{return } C \newline 9: \textbf{end procedure}$

k-Means clustering belongs to the family of expectation-maximization algorithms where starting from an initial guess $C$, $K$ centroids are iteratively inferred from a dataset $X$ containing $N$ $\mathbb{R}^d$ vectors. Each iteration the partial sum $\Sigma_k$, and quantity $\kappa_k$ of vectors nearest the $k^{th}$ centroid are aggregated. From these quantities the updated centroids $c_k$ can be calculated for the next iteration until the difference in magnitude from the former is less than some tolerance $\epsilon$ or a maximum number of iterations $I$ is observed providing a $\mathcal{O}(dIKN)$ linear time algorithm.

Parallelism is based on each participant $p$ being assigned a partition of the data so that instead of computing $\Sigma_k$ and $\kappa_k$ on the entirety of $X$, it is on some disjoint subset $X_p$ s.t. $X = \cup_p X_p$ and $\cap_p X_p = \emptyset$. When each participant has computed their $(\Sigma_k^p, \kappa_k^p)$, pairs the resulting set of values can be aggregated $(\Sigma_k, \kappa_k) = \sum_{p} (\Sigma_k^p, \kappa_k^p)$ to yield the updated centroid values $c_k$ for the next iteration.

To ensure that participants can formulate a unified view of the data, it is assumed that user supplied datasets are adequately shuffled. If the natural clusters of the data were organized in a biased way, participants would individually draw different conclusions about the centroids leading to divergence. Further, it will be assumed that a loss of an unbiased partition will not degrade the quality of results beyond an acceptable threshold under certain conditions (cf. [6] for a rich formalism based on coresets to support this assumption).

#### RDMA

One of the failings of conventional network programming is the need to copy memory between user and kernel space buffers to transmit data. This is an unnecessary tax that software solutions shouldn’t have to pay with runtime. The high performance computing community addressed this issue with remote direct memory access [12] whereby specialized network hardware directly reads from, and writes to pinned user space memory on the systems. These zero-copy transfers free the CPU to focus on core domain calculations and reduces the latency of exchanging information between systems over 56 Gbps InfiniBand or 40 Gbps RoCE.

#### Barriers

Synchronization barriers allow a number of participants to rendezvous at an execution point before proceeding as a group (cf. [12] for a survey of advanced shared-memory techniques). Multi-core kernels in the system use a counting (centralized) barrier: the first $n-1$ threads that approach the barrier will enter a critical section, decrement a counter, and sleep on a condition variable; the last thread to enter the critical section will reset the counter and signal the condition variable to wake the other threads before leaving the critical section. At this point all threads are synchronized and can proceed as a group.

Since the emphasis of this work is on fault tolerance, distributed all-to-all and one-to-all message passing versions of the counting barriers are considered (cf. [4] for more advanced distributed methods). In the former, every participant will notify and wait to hear from every other participant before proceeding. In the latter, every participant will notify and wait to hear from a coordinator before proceeding. The all-to-all design most closely resembles a counting barrier on each participant, whereas the all-to-one resembles a counting barrier on a single participant.

### Design

System participants consist of a single coordinator and multiple workers. A coordinator has the same responsibilities as a worker, but is also responsible for initiating a task and coordinating recovery. Workers are responsible for processing a task and exchanging notifications. A task is a collection of algorithm parameters (maximum iteration, convergence tolerance, number of clusters $K$), initial guesses ($K$ by $d$-dimensions), and unbiased partition of the dataset ($N$ by $d$-dimensions) that a participant is responsible for computing. Notifications consist of identifying information (host name, port), the current iteration, and partial results ($K$ counts and $K$ by $d$-dimensional sums).

#### Dataset Transfer

Figure 1: Example of transferring data between hosts using Accelio RDMA.

The first responsibility of the coordinator is to schedule portions of the disk based dataset to each worker. The coordinator will consult a list of participants and verify that the system can load the contents of the dataset into its collective memory. If so, the coordinator schedules work uniformly on each machine in the form of a task. A more sophisticated technique could be used, but a uniform loading will minimize how long each worker will wait on the others each iteration (assuming worker’s computing abilities are indistinguishable). The coordinator will sequentially distribute these tasks using RDMA, whereby the serialized contents of the task are read from disk into the coordinator’s user space memory are directly transfered over to an equally sized buffer in the worker’s user space memory before continuing on to the next worker.

Open source BSD licensed Accelio library [11] is used to coordinate the transfer of user space memory between machines as shown in Fig. (1). Following the nomenclature of the library, a client will allocate and register a block of user space memory; registration ensures that the operating system will not page out the memory while the network interface card is reading its contents. Next, the client will send a request asking the server to allocate an equally sized buffer. When the server receives this request, it will attempt to allocate and register the requested buffer. Once done, it will issue the rkey of the memory back to the client. An rkey is a unique identifier representing the location of the memory on the remote machine. Upon receipt of this information, the client will then ask the library to issue a RDMA write given the client’s and server’s rkeys. Since RDMA will bypass the server’s CPU all together, the server will not know when the operation completes; however, when the RDMA write completes, the client is notified and it can then notify the server that the operation is complete by terminating the connection and session.

During development it was discovered that the amount of memory that can be transfered this way is limited to just 64 MiB before the library errs out (xio_connection send queue overflow). To work around this limitation for larger partitions, the client will send chunks of memory in 64 MiB blocks. The same procedure detailed above is followed, however it is done for each block of memory with rkeys being exchanged for the appropriate offset on each client RDMA write complete notification until the entire contents of memory have been transfered. On each side the appropriate unregistering and deallocation of ancillary memory takes place and the worker deserializes the memory into a task before proceeding on to the first iteration of k-Means algorithm.

As an alternative to using direct RDMA writes, a design based on Accelio messaging was considered. In this design the client allocates memory for the serialized task and issues an allocation request to the server. The server services the request and the contents of memory are transfered in 8 KiB blocks before the exchange of messages is no longer necessary. While this approach requires fewer Accelio API calls to coordinate, it is significantly slower than the more involved direct RDMA based approach.

#### Iteration and aggregation

The k-Means algorithm is implemented as both a sequential and parallel multi-core kernel that can be selected at compile time. The sequential version is identical to what was discussed in the background, whereas the parallel kernel is a simplified version of the distributed system less data transfer and fault-tolerance enhancements. Data is partitioned onto threads equaling the total number of cores on the system. Each iteration threads are created to run the sequential kernel on each thread’s partition of the data. Once all the threads complete, they are joined back to the main thread where their partial results are aggregated by the main thread and used for the distributed synchronization.

#### Synchronization

In this distributed setting, the barrier must be able to accommodate the fluctuating presence of workers due to failures. Multiple designs were considered, but the all-to-all paradigm was chosen for its redundancy, local view of synchronization, and allows the coordinator to fail. The scheme may not scale well and future work would need to investigate alternative methods. Unlike the data transfer section, plain TCP sockets are used since the quantities of data being shared are significantly smaller.

A few assumptions need to be stated before explaining the protocol. First, the coordinator is allowed to be a single point of failure for recovery and failures for all participants are assumed to be network oriented up to network partitioning. Network partitions can operate independently, but cannot be reunified into a single partition. All other hardware resources are assumed to be reliable. Finally, the partition associated with a lost worker is not reassigned to another worker. It is simply dropped from the computations under the assumption that losing that partition will not significantly deteriorate the quality of the results up to some tolerance as mentioned in the introduction.

The first step to supporting transient workers is to support process restart. When a worker receives a task from the coordinator, it will serialize a snapshot to disk, service the task, and then remove the snapshot. When a worker process is started, it will deserialize an existing snapshot, and verify that the task is ongoing with the coordinator before joining, or discard the stale snapshot before awaiting the next task from the coordinator.

Figure 2: Example in which a worker reintegrates with the group after being offline for several iterations. Red lines denote blocking. (Timeout queries not shown.)

The distributed barrier next maintains a list of active, inactive and recovered participants in the system. The first phase of synchronization is the notification phase in which each participant will issue identifying information, current iteration, and its partial results for the k-Means algorithm to all other active participants. Those participants that cannot be reached are moved from the active to inactive list.

The second phase is the waiting phase in which a listening thread accumulates notifications on two separate blocking unbounded queues in the background. The recovery queue is for notifications, and the results queue for sharing partial results. For each active participant the results queue will be dequeued, and for each inactive participant, the results queue will be peeked and dequeued if nonempty. Because the results queue is blocking, a timeout is used to allow a participant to verify that another participant hasn’t gone offline in the time it took the participant to notify the other and wait for its response. If the other participant has gone offline, then it is moved from the active to inactive list, otherwise the participant will continue to wait on the other.

Each partial result from the results queue will be placed into a solicited or unsolicited list based on if its origin is a participant that was previously notified. The coordinator will then locally examine the unsolicited list and place those zero iteration requests in the recovered list when it is in a nonzero iteration. Workers will examine the unsolicited list and discard any requests whose iteration does not match their own.

The recovery phase begins by an inactive worker coming back online and sending their results to the coordinator, and then waiting to receive the coordinators results and a current iteration notification. The next iteration, the coordinator will look at its recovered list and send the current iteration to the recovering worker, then wait until it receives a resynchronized notification. Upon receiving the current iteration notification, the recovering worker will then go and notify all the other workers in the cluster of its results, and wait for them to response before issuing a resynchronized notification to the coordinator. At which point the recovering worker is fully synchronized with the rest of the system. Once the coordinator receives this notification on its recovery queue, it will move the recovering worker off the inactive and recovery lists and on to the active list before notifying the other workers of its results.

Once the notification and waiting phase have completed, all participants are synchronized and may independently find the next set of centroids by aggregating their partial results from both solicited and unsolicited lists, and then begin the next iteration of the k-Means algorithm. This process will continue until convergence, or the maximum number of iterations has been reached.

### Experiments

Experiments were conducted on four virtual machines running Red Hat 4.8 with 8 GiB of RAM and single tri-core Intel Xeon E312xx class 2.6 GHz processor. Host machines are equipped with Mellanox Connect X-3 Pro EN network interface cards supporting 40 Gbs RoCE. Reported times are wall time as measured by gettimeofday. All C++98 source code compiled using g++ without any optimization flags.

Spark runtime comparison ran on Amazon Web Services Elastic Map Reduce with four m1.large instances running dual core Xeon E5 2.0 GHz processor with 7.5 GiB of RAM supporting 10 Gbps Ethernet. Spark 1.5.2 comparison based on MLlib KMeans.train and reported time measured by System.currentTimeMillis. All Java 1.7 code compiled by javac and packaged by Maven.

#### Transfer Rates

 Figure 3: Transfer rates for variable block sizes up to 64 MiB and fixed 128 MiB payload. Figure 4: Transfer rates for fixed 64 MiB RDMA and 8 KiB message based block sizes and variable payloads up to 7 GiB.

Fig. (3) shows the influence of block size on transfer rate of a fixed payload for RDMA based transfers. As mentioned in the design section, Accelio only supports block sizes up to 64 MiB for RDMA transfers. When the block size is 8 KiB, it takes 117x longer to transfer the same payload than when a block size of 64 MiB is used. This performance is attributed to having to exchange fewer messages for the same volume of data. For the payload of 128 MiB considered, a peak transfer rate of 4.7 Gbps was obtained.

Fig. (4) looks at the relationship of a fixed 64 MiB block size for RDMA transfers up to 7 GiB before exhausting available system resources. A peak transfer rate of 7.7 Gbps is observed which is still significantly less than the 40 Gbps capacity of the network. This would suggest a few possibilities: the Mellanox X-3 hardware was configured incorrectly, there may be network switches limiting the transfer of data, or that there is still room for improvement in the Accelio library.

It is unclear why there should be a kink in performance at 2 GiB. Possible explanations considered the impacts of hardware virtualization, influence of memory distribution on the physical DRAM, and potential Accelio issues. Further experiments are needed to pinpoint the root cause.

To demonstrate the advanced performance of Accelio RDMA based transfers over Accelio messaging based transfers, Fig. (4) includes transfer performance based on 8 KiB messaging. For the values considered, the message based approach was 5.5x slower than the RDMA based approach. These results are a consequence of the number of messages being exchanged for the smaller block size and the overhead of using Accelio.

#### Recovery time

Figure 5: Example of Worker A going offline at iteration 29 and coming back online at iteration 69.

Fig. (5) demonstrates a four node system in which one worker departs and returns to the system forty iterations later. Of note is the seamless departure and reintegration into the system without inducing increased synchronization times for the other participants. For the four node system, average synchronization time was 16 ms. For recovering workers, average reintegrate time was 225 ms with high variance.

#### Runtime performance

Total Percentage
Sharing 721.9 6.7
Computing 8558.7 79.1
Synchronizing 1542.5 14.2
Unaccounted 6.2 0.1
Total 10820.4 100
Table 1: Time in milliseconds spent in each task for 100 iterations, $d = 2, K = 4, N = 10,000,000$.

Looking at the distribution of work, roughly 80% of the work is going towards actual computation that we care about and the remaining 20% what amounts to distributed system bookkeeping. Of that 20% the largest chunk is synchronization suggesting that a more efficient implementation is needed and that it may be worth abandoning sockets in favor of low latency, RDMA based transfers.

 Figure 6: Runtime for varying input based Accelio messaging and sequential kernel, and RDMA and parallel kernel. The latter typically being 2.5x faster Figure 7: Sequential, parallel, distributed versions of k-Means for varying input sizes with Spark runtime for reference.

Runtime of the system for varying inputs is shown in Fig. (6) based on its original Accelio messaging with sequential calculations, and final RDMA transfers with parallel calculations. The latter cuts runtime by 2.5x which isn’t terrible since each machine only has three cores.

The general runtime for different configurations for the final system is shown in Fig. (7). The sequential algorithm is well suited to process inputs less than ten thousand, parallel less than one million, and the distributed system for anything larger. In general, the system performed 40-50x faster than Spark on varying input sizes for a 9 core vs 8 core configuration. These are conservative estimates since the system does 100 fixed iterations, whereas and Spark’s MLlib $\text{k-Means} \lvert \lvert$ implementation stops as soon as possible (typically 5-10 iterations).

Figure 8: Speedup of the system relative to the sequential algorithm.

The overall speedup of the system relative to the sequential algorithm is shown in Fig. (8). For a 12 core system we observe at most a 7.3x runtime improvement (consistent with the performance and sequential vs. parallel breakdowns). In general, in the time it takes the distributed system to process 100 million entries, the sequential algorithm would only be able to process 13.7 million entries. Further optimization is desired.

### Discussion

#### Related Work

The established trend in distributed machine learning systems is to use general purpose platforms such as Google’s MapReduce and Apache’s Spark and Mahout. Low latency, distributed shared memory systems such as FaRM and Grappa are gaining traction as ways to provide better performance thanks to declining memory prices and greater RDMA adoption. The next phase of this procession is exemplified by GPUNet [7], representing systems built around GPUDirect RDMA, and will likely become the leading choice for enterprise customers seeking performance given the rise of deep learning applications on the GPU.

The design of this system is influenced by these trends with the overall parallelization of the k-Means algorithm resembling the map-reduce paradigm and RDMA transfers used here reflecting the trend of using HPC scale technologies in the data center. Given that this system was written in a lower level-language and specialized for the task, it isn’t surprising that it delivered better performance (40-50x) than the leading general purpose system (Apache Spark) written in higher level language.

That said, given the prominence of general purpose distributed machine learning systems, specialized systems for k-Means clustering are uncommon and typically designed for unique environments (e.g., wireless sensor networks [13]). In the cases where k-means is investigated, the focus is on approximation techniques that are statistically bounded [6], and efficient communication patterns [2]; both of these considerations could further be incorporated into this work to realize better performance.

For the barrier, most of the literature focuses on static lists of participants in shared [12] and distributed [4] multi-processor systems with an emphasis on designing solutions for specific networks (e.g., n-dimensional meshes, wormhole-routed networks, etc). For barriers that accommodate transient participants, the focus is on formal verification with [8] and [1] focused on shared and distributed systems respectively. No performance benchmarks could be found for a direct comparison to this work, however, [5] presents a RDMA over InfiniBand based dissemination barrier that is significantly faster ($\mu s$ vs $ms$) suggesting opportunities for future RDMA use.

#### Future Work

##### RDMA

Accelio provides a convenient API for orchestrating RDMA reads and writes at the expense of performance. Accelio serves a niche market and its maturity reflects that reality. There are opportunities to make Accelio more robust and capable for transferring large chunks of data. If this avenue of development is not fruitful, alternative libraries such as IBVerbs may be used in an attempt to obtain advertised transfer rates for RDMA enabled hardware.

As implemented, the distribution of tasks over RDMA is linear. This was done to take advantage of sequential read speeds from disk, but it may not scale well for larger clusters. Additional work could be done to look at pull based architectures where participants perform RDMA reads to obtain their tasks rather than the existing push based architecture built around RDMA writes. As well as exploring these paradigms for distributed file systems to accommodate larger datasets.

Calculations presented were multi-core oriented, but k-Means clustering can be done on the GPU as well. Technologies such as NVIDIA’s GPUDirect RDMA and AMD’s DirectGMA present an opportunity to accelerate calculations by minimizing expensive memory transfers between host and device in favor of between devices alone. Provided adequate hardware resources, this could deliver several orders magnitude faster runtime performance.

##### Kernel

As demonstrated in the experiments section, overall runtime is heavily dominated by the actual k-Means iteration suggesting refinements in the implementation will lead to appreciable performance gains. To this effect, additional work could be put into the sequential kernel to make better use of SIMD features of the underlying processor. Alternatively, the OpenBLAS library could be used to streamline the many linear algebra calculations since they already provides highly optimized SIMD features. Accelerators like GPUs, FPGAs, and MICs could be investigated to serve as alternative kernels for future iterations of the system.

##### Barrier

The barrier is by no means perfect and it leaves much to be desired. Beginning with allowing recovery of a failed coordinator, allowing reunion of network partitions, dynamic worker onboarding, and suspending calculations when too much of the system goes offline to ensure quality results. Once these enhancements are incorporated in to the system, work can be done to integrate the underlying protocols away from a socket based paradigm to that of RDMA. In addition, formal verification of the protocol would guide its use into production and on to larger clusters. The system works reasonably well on a small cluster, but further work is needed to harden it for modern enterprise clusters.

##### Overall System

As alluded to in the background, shuffling of data could be added to the system so that end users do not have to do so beforehand. Similarly, more sophisticated scheduling routines could be investigated to ensure an even distribution of work on a system of machines with varying capabilities.

While the k-Means algorithm served as a piloting example, work could be done to further specialize the system to accommodate a class of unsupervised clustering algorithms that fit the general map-reduce paradigm. The end goal is to provide a plug-n-play system with a robust assortment of optimized routines that do not require expensive engineers to setup, exploit, and maintain as is the case for most existing platforms.

Alternatively, the future of the system could follow the evolution of other systems favoring a generalized framework that enable engineers to quickly distribute arbitrary algorithms. To set this system apart from others, the emphasis would be on providing an accelerator agnostic environment where user specified programs run on whatever accelerators (FPGA, GPU, MIC, multi-core, etc.) are present without having to write code specifically for those accelerators. Thus saving time and resources for enterprise customers. Examples of this paradigm are given by such libraries as CUDAfy.NET and Aparapi for translating C# and Java code to run on arbitrary GPUs.

#### Conclusion

This work described a cross-platform, fault-tolerant distributed system that leverages the open source library Accelio to move large volumes of data via RDMA. Transfer rates up to 7.7 Gbps out of the desired 40 Gbps were observed and it is assumed if flaws in the library were improved, that faster rates could be achieved. A synchronization protocol was discussed that supports transient workers in the coordinated calculation of k-Means centroids. The protocol works well for small clusters, offering 225 ms reintegration times without significantly affecting other participants in the system. Further work is warranted to harden the protocol for production use. Given the hardware resources available the distributed system was 7.3x out of the desired 12x faster than the sequential alternative. Compared to equivalent data loads, the system demonstrated 40-50x better runtime performance than Spark MLlib’s implementation of the algorithm suggesting the system is a competitive alternative for k-Means clustering.

### References

[1] Shivali Agarwal, Saurabh Joshi, and Rudrapatna K Shyamasundar. Distributed generalized dynamic barrier synchronization. In Distributed Computing and Networking, pages 143-154. Springer, 2011.

[2] Maria-Florina F Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median clustering on general topologies. In Advances in Neural Information Processing Systems, pages 1995-2003, 2013.

[3] Mike Estlick, Miriam Leeser, James Theiler, and John J Szymanski. Algorithmic transformations in the implementation of k-means clustering on recongurable hardware. In Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays, pages 103-110. ACM, 2001.

[4] Torsten Hoeer, Torsten Mehlan, Frank Mietke, and Wolfgang Rehm. A survey of barrier algorithms for coarse grained supercomputers. 2004.

[5] Torsten Hoeer, Torsten Mehlan, Frank Mietke, and Wolfgang Rehm. Fast barrier synchronization for inniband/spl trade. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, pages 7-pp. IEEE, 2006.

[6] Ruoming Jin, Anjan Goswami, and Gagan Agrawal. Fast and exact out-of-core and distributed k-means clustering. Knowledge and Information Systems, 10(1):17-40, 2006.

[7] Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Amir Wated, Emmett Witchel, and Mark Silberstein. Gpunet: Networking abstractions for gpu programs. In Proceedings of the International Conference on Operating Systems Design and Implementation, pages 6-8, 2014.

[8] Duy-Khanh Le, Wei-Ngan Chin, and Yong-Meng Teo. Verication of static and dynamic barrier synchronization using bounded permissions. In Formal Methods and Software Engineering, pages 231-248. Springer, 2013.

[9] Xiaobo Li and Zhixi Fang. Parallel clustering algorithms. Parallel Computing, 11(3):275-290, 1989.

[10] Stuart P Lloyd. Least squares quantization in pcm. Information Theory, IEEE Transactions on, 28(2):129-137, 1982.

[11] Mellanox Technologies Ltd. Accelio – the open source i/o, message, and rpc acceleration library.

[12] John M Mellor-Crummey and Michael L Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS), 9(1):21-65, 1991.

[13] Gabriele Oliva, Roberto Setola, and Christoforos N Hadjicostis. Distributed k-means algorithm. arXiv, 2013.

[14] Renato Recio, Bernard Metzler, Paul Culley, Je Hilland, and Dave Garcia. A remote direct memory access protocol specication. Technical report, 2007.

[15] Mario Zechner and Michael Granitzer. Accelerating k-means on the graphics processor via cuda. In Intensive Applications and Services, 2009. INTENSIVE’09. First International Conference on, pages 7-15. IEEE, 2009.

Written by lewellen

2015-12-22 at 1:43 pm

## Using GPUs to solve Spatial FitzHugh-Nagumo Equation Numerically

### Introduction

#### Modeling Action Potentials

In neurophysiology, we wish to model how electrical impulses travel along the complex structures of neurons. In particular, along the axon since it is the principle outbound channel of the neuron. Along the axon are voltage gated ion channels in which concentrations of potassium and sodium are exchanged. During depolarization, a fast influx of sodium causes the voltage to gradually increase. During repolarization, a slow outflux of potassium causes the voltage to decrease gradually. Because the potassium gates are slow to close, there is a negative voltage dip and recovery phase afterwards called hyperpolarization.

Attempts to provide a continuous time model of these phases began with work in the 1950s by Hodgkin and Huxley [HH52]. The duo formulated a four-dimensional nonlinear system of ordinary differential equations based on an electrical circuit model of giant squid axons. This landmark contribution was later simplified in the 1960s by FitzHugh [Fit61]. Here FitzHugh casted the problem in terms of Van der Pol oscillators. This reformulation allowed him to explain the qualitative behavior of the system in terms of a two-dimensional excitation/relaxation state-space model. The following year Nagumo [NAY62] extended these equations spatially to model the desired action potential propagation along an axon, and demonstrated its behavior on analog computers. This spatial model will be the focus of this work, and a more comprehensive account of these developments can be found in [Kee02].

#### FitzHugh-Nagumo Equation

The Spatial FitzHugh-Nagumo equation is a two-dimensional nonlinear reaction-diffusion system:

 $\displaystyle \frac{\partial }{\partial t} v = \frac{\partial^2 }{\partial^2 x} v + v \cdot (1 - v) \cdot (v - a) - w \qquad \frac{\partial }{\partial t} w = \epsilon \left( v - \gamma w \right )$ $\displaystyle v(x, 0) = v_0 \qquad w(x, 0) = 0 \qquad \frac{\partial}{\partial x} v(0, t) = 0 \qquad \frac{\partial}{\partial x} v(L, t) = 0$ (1)

Here $v(x, t)$ represents the action potential (voltage) along the axon, and $w(x, t)$ the recovery of the system. Constant $0 < a 0$ is absent from the literature, and $\epsilon > 0$ is inversely proportional to recovery time.

#### Mathematical Analysis

The system does not admit a general analytic solution. Mathematical analysis of the system is concerned with the existence and stability of traveling waves with [AK15] providing a thorough account of these aspects. Other analyses are concerned with the state-space model and the relationship of parameters with observed physiological behavior. In particular: self-excitatory, impulse trains, single traveling wavefronts and impulses, doubly traveling divergent wave impulses, and non-excitatory behavior that returns the system to a resting state. Since the FitzHugh-Nagumo equations are well understood, more recent literature is focused on higher dimensional and stochastic variants of the system which [Tuc13] discusses in detail.

#### Numerical Analysis

A survey of the literature revealed several numerical approaches to solving the FitzHugh-Nagumo equations consisting of a Finite Element method with Backward Differentiation Formulae [Ott10], and the Method of Lines [Kee02] approach. In this work, three approaches based on the Finite Difference method are investigated: an explicit scheme, an adaptive explicit scheme, and an implicit scheme; along with their associated sequential CPU bound and parallel GPU bound algorithms.

### Explicit Finite Difference Scheme

To study the basic properties of the FitzHugh-Nagumo equations an explicit scheme was devised using forward and central differences for the temporal and spatial derivatives respectively.

 $\displaystyle v^{j + 1}_{i} = \frac{ \Delta t }{ \Delta x^2 } \left ( v^{j}_{i+1} - 2 v^{j}_{i} + v^{j}_{i-1} \right ) + v^{j}_{i} + \Delta t \left ( v^{j}_{i} \cdot \left( 1 - v^{j}_{i} \right) \cdot \left( v^{j}_{i} - a \right ) - w^{j}_{i} \right )$ $\displaystyle w^{j+1}_{i} = \Delta t \epsilon \left( v^{j}_{i} - \gamma w^{j}_{i} \right) + w^{j}_{i}$ $\displaystyle v^{0}_{i} = v_0( i \Delta x, 0) \qquad w^{0}_{i} = 0 \qquad v^{j}_{0} = v^{j}_{1} \qquad v^{j}_{N-2} = v^{j}_{N-1}$ (2)

Truncation errors are linear in time, $\mathcal{O}(\Delta t)$, and quadratic, $\mathcal{O}(\Delta x^{2})$, in space. For stability, $2 \Delta t \le \Delta x^2$, implying the use of inconveniently small time steps, however in practice it seems rare to find any mention of $\Delta x < 0.5$ in the literature. Each time step can be computed sequentially or in parallel in linear time, $\mathcal{O}(n)$. However, there is a significant constant factor improvement delivered by parallel GPU implementations since modern GPUs allow millions of nodes to be updated in parallel giving near constant runtime performance. Further parallelization is possible by trivially distributing contiguous node sets to a combination of multiple machines and multiple GPUs.

#### Experiments

 Figure 1: Depiction of excitation and relaxation with a constant input on the tail end of the axon. Waves travel right-to-left. State of system shown for $t = 306.19$. Figure 2: Impulse at $t = 0$ grows axially before its peak collapses at $t = 37.85$ giving rise to two stable impulses that travel in opposite directions before vanishing at $t = 242.98$.

The explicit scheme was used to investigate the traveling wave and divergent wave behaviors of the FitzHugh-Nagumo equations. Fig. (1) demonstrate the application of a constant impulse, $v(t, L) = 1$, at the end of an unexcited axon, $v_0(x) = 0$, giving rise to oscillatory behavior. Fig. (2) shows a Gaussian impulse applied initially to the center of the axon, $v_0(x) = \exp{- \frac{(x - 100)^2}{10}}$. As the system evolves, the impulse collapses giving rise to two separate impulses that travel in opposite directions before dissipating at the boundaries returning the axon to a completely unexcited state. Both test cases are qualitatively consistent with the literature and share the following parameters:

 $\displaystyle a = 0.01 \qquad \gamma = 2 \qquad \epsilon = 0.01 \qquad N = 401 \qquad \Delta x = 0.5 \qquad 2 \Delta t < \Delta x^2$ (3)

#### Error Analysis

 Figure 3: Varying values of $\Delta x$ w.r.t $\Delta x = 0.000045$ at $t = 0.91$. $\Delta t$ fixed to 0.0098. Figure 4: Varying values of $\Delta t$ w.r.t $\Delta t = 0.00023$ at different values of $t$. $\Delta x$ fixed to 0.5.

Two experiments were ran to verify the suggested truncation errors. The numerical solution given by a sufficiently small step size serves as an analytic solution to $v(x, t)$, which is then used to evaluate how larger steps sizes deviate as measured by the root mean squared error. Fig. (3) looks at varying $\Delta x$ and shows that as the step size as halved, the resulting RMSE is quartered consistent with the expect quadratic truncation term of the scheme. Similarly, Fig. (4) looks at varying $\Delta t$ and shows that independent of $t$, halving the step size results in an equally reduced RMSE consistent with the expected linear truncation term of the scheme.

#### Runtime Performance

 Figure 5: Wall time for memory allocation, memory transfers, and core loop for both CPU and GPU. Figure 6: Wall time for just core loop for both CPU and GPU.

Comparison of sequential CPU and parallel GPU bound algorithms is based on the wall time taken to perform a single iteration to calculate $v^{j+1}$ and $w^{j+1}$. Reported figures are a function of spatial node quantity and the mean run time of 30 iterations. The main bottleneck to performance is transferring buffers between the host and device. Fig. (5) illustrates this effect. For the largest test case considered, $N = 2^{24}$, “GPU w/o tx” delivers 19x faster runtime over “GPU w/ tx” by not transferring intermediate results back to the host. Similarly, it delivers significantly faster runtime over the CPU by a factor of 73x. To better understand this performance boost, Fig. (6) looks at just the core loop execution time. For $N = 2^{24}$, the core loops account for only 3.7% and 14.9% of the execution time on the CPU and GPU respectively with the GPU delivering 18x better performance than the CPU. These percentages increase monotonically, and suggest that memory transfers will eventually be dominated by sufficiently large inputs on GPUs with an abundance of memory.

### Adaptive Explicit Finite Difference Scheme

While investigating the traveling wavefront solution of the system, numerical oscillations were observed as the simulation zeroed in on the steady state solution. To address this issue, an adaptive explicit scheme was devised. A survey of the literature suggested a family of moving grid methods to solve the FitzHugh-Nagumo system based on a Lagrangian formulation [VBSS90] and a method of lines formulation [Zwa11]. Here a heuristic is used to concentrate grid points where the most change takes place.

The formulation of the scheme is identical to the former section with second order spatial derivatives being approximated by Lagrange interpolating polynomials since the technique supports non-uniform grids. A three point, first order truncation error scheme is used. A five point, third order truncation error scheme was considered, but abandoned in favor of the empirically adequate three point scheme.

 $\displaystyle \frac{\partial^2 }{\partial x^2} f(x_1) \approx 2 \sum_{i=0}^2 y_i \prod_{\substack{j = 0 \\ j \neq i}}^2 \frac{1}{(x_i - x_j)} + \frac{f^{(4)}(\xi_x)}{6} \left [ \prod_{\substack{i = 0 \\ i \neq 1}}^{2} (x - x_i) + 2 \sum_{\substack{i=0 \\ i \neq 1}}^{2}(x-x_i) \right ]$ (4)

The first and second spatial derivatives given by the Lagrange interpolating polynomials are used to decide how much detail is needed in the domain in addition to the first temporal derivative given by finite differences. First, a coarse grid is laid down across the entire domain to minimize the expected distance between nodes since the second order derivative has first order truncation error. Next, the magnitude of the first order spatial derivative (second order truncation error) is used to lay down a finer grid when the derivative is greater than a specified threshold. This corresponds to where the waves of the solution are.

Next, the temporal derivative is used to lay down an even finer grid in those areas having an absolute change above a specified threshold. The change in time corresponds to the dynamics of the system, by adding detail in these areas, we can preserve the behavior of the system. Finally, the zeros of the second spatial derivative serve as indicators of inflection points in the solution. These correspond most closely to the location of the traveling wavefronts of the equation. Here, the most detail is provided around a fixed radius of the inflection points since the width of the wavefronts does not depend on parameterization.

Each iteration, the explicit scheme is evaluated on the grid from the previous iteration and those results are then used to perform the grid building scheme. To map the available solution values to the new grid, the points are linearly interpolated if a new grid point falls between two previous points, or mapped directly if there is a stationary grid point between the two iterations. The latter will be the more common case since all grid points are chosen from an underlying uniform grid specified by the user. Linear interpolation will only take place when extra grid points are included in an area.

#### Experiments

Figure 7: Top: Numerical oscillation of a centered Gaussian impulse with $\epsilon = 0$ and all other parameters the same as those given in previous section for $t = 60.11, 90.12, 120.11$. Bottom: Eliminated numerical oscillation based on the adaptive explicit scheme.

Fig. (7) is the motivating example for this scheme and demonstrates how numerical oscillations can be avoided by avoiding calculations in regions with stationary solutions. To demonstrate that the scheme works well for other test cases, the more interesting and dynamic divergent impulse test case is shown in Fig. (8). Here we can see that as time progresses, points are allocated to regions of the grid that are most influenced by the system’s dynamics without sacrificing quality. For the time steps shown, the adaptive scheme used 2-3x fewer nodes than the explicit scheme from the previous section.

Figure 8: Example of adaptive grid on the divergent impulse example for $t \in$ { 15.06, 30.13, 45.05, 60.12, 75.05, 90.11 }. Red lines given by the explicit scheme from the previous section, green dots given by adaptive explicit scheme.

### Implicit Finite Difference Scheme

An implicit Crank-Nicolson scheme is used in this section to solve the FitzHugh-Nagumo equations. To simplify the computations, $w^{j+1}$ is solved explicitly using a second order central difference scheme before solving $v^{j+1}$ leading to the following formulation:

 $\displaystyle \frac{v^{j+1}_{i}}{\Delta t} - \frac{1}{2} \left [ \frac{v^{j+1}_{i+1} - 2 v^{j+1}_{i} + v^{j+1}_{i-1}}{\Delta x^2} + v^{j+1}_{i} \left( 1 - v^{j+1}_{i} \right ) \left( v^{j+1}_{i} - a \right ) \right ]$ $\displaystyle \qquad = \frac{v^{j}_{i}}{\Delta t} + \frac{1}{2} \left [ \frac{v^{j}_{i+1} - 2 v^{j}_{i} + v^{j}_{i-1}}{\Delta x^2} + v^{j}_{i} \left( 1 - v^{j}_{i} \right ) \left( v^{j}_{i} - a \right ) \right ] + \frac{1}{2} \left [ - w^{j}_{i} - w^{j+1}_{i} \right ]$ $\displaystyle w^{j+1}_{i} = 2 \Delta t \epsilon \left( v^{j}_{i} - \gamma w^{j}_{i} \right) + w^{j-1}_{i}$ $\displaystyle v^{0}_{i} = v_0( i \Delta x, 0) \qquad w^{0}_{i} = 0 \qquad v^{j}_{0} = v^{j}_{1} \qquad v^{j}_{N-1} = v^{j}_{N-2}$ (5)

The truncation error for this scheme is $\mathcal{O}\left( \Delta t^2 + \Delta x^2 \right)$ with an improved, albeit still restrictive, requirement over the explicit scheme that $\Delta t \le \Delta x^2$. Based on this formulation, the right-hand side is completely known when $v^{j+1}$ is calculated. This gives a simpler expression to consider:

 $\displaystyle F(v^{j+1}) = LHS(v^{j+1}) - RHS = 0$ (6)

Newton’s method is used to solve the nonlinear function $F$ with an initial guess equal to the previous time step’s values, $\hat{v}^{0} = v^{n}$. To refine the estimate, the following system is solved iteratively until the magnitude of the refinement, $\delta v$, is less than a specified tolerance or machine epsilon.

 $\displaystyle \hat{v}^{n+1} = \hat{v}^{n} - J^{-}(\hat{v}^{n}) F(\hat{v}^{n}) \qquad J(\hat{v}^{n}) \delta v = F(\hat{v}^{n}) \qquad \hat{v}^{n+1} = \hat{v}^{n} - \delta v$ (7)

This formulation gives rise to a tridiagonal Jacobian matrix, $J$, where the first and last row are specified by no-flux boundary conditions, and constant $\alpha$ off-diagonal entries and variable $\beta_i$ diagonal entries given by the partial derivatives of each nodal equation.

$J = \begin{pmatrix} 1 & 1 & 0 & \cdots & & \\ \alpha & \beta_1 & \alpha & 0 & \cdots & \\ 0 & \alpha & \ddots & \alpha & 0 & \cdots \\ \cdots & 0 & \alpha & \ddots & \alpha & 0 \\ & \cdots & 0 & \alpha & \beta_{n-1} & \alpha \\ & & \cdots & 0 & 1 & 1 \\ \end{pmatrix}$
 $\displaystyle \alpha = \frac{\partial}{\partial v_{i \pm 1}} F = - \frac{1}{2} \left [ \frac{1}{\Delta x^2} \right ]$ (8)

 $\beta_i = \frac{\partial}{\partial v_{i}} F = \frac{1}{\Delta t} - \frac{1}{2} \left [ -\frac{2}{\Delta x^2} -3 {v_{i}}^2 + 2(1 + a) v_{i} - a \right ]$ (9)

This tridiagonal system can be solved sequentially in $\mathcal{O}(n)$ time using the Thomas algorithm or in $\mathcal{O}(n \log_2 n)$ time using the Cyclic Reduction algorithm of [Hoc65]. Cyclic Reduction begins with a forward phase in which all odd indexed unknowns are eliminated recursively until a $2 \times 2$ or $3 \times 3$ system remains that can be solved directly. Cyclic Reduction ends with a backward phase that walks up the recursion stack to solve for the previously eliminated odd indexed unknowns. Since the algorithm decouples unknowns at each recursion level of the forward and backwards phases, these reductions can be done in parallel in $\mathcal{O}(\log_2 n)$ time on the GPU assuming $n-$fold parallelism.

Further parallelism can be achieved by solving points explicitly along the domain, then using those results to create implicit subdomains that can be solved using either Thomas or Cyclic Reduction algorithms on a combination of multiple machines and multiple GPUs at the expense of additional communication and coordination.

#### Error Analysis

 Figure 9: Varying values of $\Delta x$ w.r.t $\Delta x = 0.012$ at $t = 0.51$. $\Delta t$ fixed to 0.000071. Figure 10: Varying values of $\Delta t$ w.r.t $\Delta t = 0.00025$ at different values of $t$. $\Delta x$ fixed to 0.52.

Evaluation of the spatial error revealed an unexpected linear behavior as shown in Fig. (9). As the spatial step is halved, the resulting error was expected to become quartered, instead it is halved. No clear explanation was discovered to account for this discrepancy. With respect to time, Fig. (10) shows that both the Thomas and Cyclic Reduction algorithms were quadratic as multiple points in time were evaluated. The Thomas algorithm produced aberrations as the step size increased eventually producing numerical instability, while the Cyclic Reduction algorithm was immune to this issue.

 Figure 11: Stability of implicit solvers. Figure 12: Convergence of implicit solvers.

In terms of stability, the implicit scheme is stable up to $\Delta t \le \Delta x^2$ depending on which tridiagonal solver is used, and for comparison, the explicit scheme is stable up to precisely $2 \Delta t \le \Delta x^2$ as shown in Fig. (11). The Thomas algorithm demonstrates a slightly weaker stability guarantee than the Cyclic Reduction algorithm becoming unstable around $\Delta t / \Delta x^2 \approx 0.9$ and $\Delta t / \Delta x^2 \approx 1.05$ respectively. In terms of convergence, the Thomas algorithm typically takes more iterations than Cyclic Reduction to obtain the same degree of accuracy as shown in Fig. (12). Taking the sequence of adjustments up to machine epsilon $(2^{-52})$, Thomas algorithm gives $\lVert \delta v^{n+1} \rVert < 0.037 \lVert \delta v^{n} \rVert$ suggesting q-linear convergence. Likewise, Cyclic Reduction gives $\lVert \delta v^{n+1} \rVert < 0.17 \lVert {\delta v^{n}}^2 \rVert$ suggesting q-quadratic convergence.

#### Runtime Performance

 Figure 13: Performance comparison of CPU and GPU bound Thomas and Cyclic Reduction algorithms. Figure 14: Performance comparison of Jacobian solvers.

Sequential Thomas and Cyclic Reduction routines perform equally well on CPU as shown in Fig. (13). The parallel Cyclic Reduction method did not demonstrate significant performance gains on the GPU. However, looking at just the time taken to solve the Jacobian each iteration (not including initialization or memory transfers), parallel Cyclic Reduction on the GPU was 4-5x faster than both sequential CPU solvers as shown in Fig. (14).

To explain the poor performance of Cyclic Reduction on the GPU, there are a number of different factors at play. The algorithm is susceptible to warp divergence due to the large number of conditionals that take place. Reliance on global memory access with varying strides contributes to slow performance since shared memory can’t be effectively utilized, and each iteration the adjustments are transferred from device to host to decide if Newton’s method should terminate. These different factors suggest alternative GPU implementations need to be investigated to address these different issues.

### Discussion

#### Work Environment

All results in this paper were based on code written in C#, and compiled using Microsoft Visual Studio Express with Release settings to run on a commodity class Intel Core i7-2360QM quad core processor. Open source CUDAfy.NET is used to run C# to CUDA translated code on a commodity class NVIDIA GeForce GT 525m having two streaming multiprocessors providing 96 CUDA cores.

#### Future Work

Numerically, additional work could be done on the adaptive explicit scheme. In preparing the scheme, a cubic spline-based approach was abandoned in favor of simpler approaches due to time pressures. It would be worthwhile to explore how to solve the system on a spline-based “grid”.

In addition, further work could be done to optimize the implementation of the parallel Cyclic Reduction algorithm on the GPU since it delivered disappointing runtime behavior compared to the sequential algorithm on the CPU. [CmWH14] mention several different optimizations to try, and I believe better global memory access will improve runtime at the expense of more complicated addressing. As an alternative to Cyclic Reduction, both [CmWH14] and [ZCO10] detail several different parallel tridiagonal solvers that could be deployed.

In terms of models considered, there are a number of different directions that could be pursued including higher dimensional [MC04], coupled [Cat14]], [Ril06], and stochastic variants [Tuc13] of the spatial FitzHugh-Nagumo equation. Coming from a probabilistic background, I would be interested in investing time in learning how to solve stochastic ordinary and partial differential equations.

#### Conclusion

Three finite difference schemes were evaluated. An explicit scheme shows great performance on both the CPU and GPU, but it is susceptible to numerical oscillations. To address this issue, an adaptive explicit scheme based on heuristics was devised and is able to eliminate these issues while requiring fewer nodes to produce results on-par with the explicit scheme. An implicit scheme was evaluated which demonstrated a principled, and robust solution for a variety of test cases and is the favored approach of the three evaluated.

### References

[AK15] Gianni Arioli and Hans Koch. Existence and stability of traveling pulse solutions of the fitzhugh-nagumo equation. Nonlinear Analysis: Theory, Methods and Applications, 113:51-70, 2015

[Cat14] Anna Cattani. Fitzhugh-nagumo equations with generalized diffusive coupling. Mathematical Biosciences and Engineering, 11(2):203-215, April 2014

[CmWH14] Li-Wen Chang and Wen mei W. Hwu. A guide for implementing tridiagonal solvers on gpus. In Numerical Computations with GPUs, pages 29-44. Springer International Publishing, 2014.

[Fit61] Richard FitzHugh. Impulses and physiological states in theoretical models of nerve membrane. Biophysical journal, 1(6):445, 1961.

[HH52] Alan L. Hodgkin and Andrew F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of physiology, 117(4):500-544, 1952.

[Hoc65] R. W. Hockney. A fast direct solution of poisson’s equation using fourier analysis. J. ACM, 12(1):95-113, Jan 1965.

[Kee02] James P. Keener. Spatial modeling. In Computational Cell Biology,volume 20 of Interdisciplinary Applied Mathematics, pages 171-197. Springer New York, 2002.

[MC04] Maria Murillo and Xiao-Chuan Cai. A fully implicit parallel algorithm for simulating the non-linear electrical activity of the heart. Numerical linear algebra with applications, 11(2-3):261-277, 2004.

[NAY62] J. Nagumo, S. Arimoto, and S. Yoshizawa. An active pulse transmission line simulating nerve axon. Proceedings of the IRE, 50(10):2061-2070, Oct 1962.

[Ott10] Denny Otten. Mathematical models of reaction diffusion systems, their numerical solutions and the freezing method with comsol multiphysics. 2010.

[Ril06] Caroline Jane Riley. Reaction and diffusion on the sierpinkski gasket. PhD thesis, University of Manchester, 2006.

[Tuc13] Henry C. Tuckwell. Stochastic partial differential equations in neurobiology. Linear and nonlinear models for spiking neurons. In Stochastic Biomathematical Models, volume 2058 of Lecture Notes in Mathematics. Springer Berlin Heidelberg, 2013.

[VBSS89] J. G. Verwer, J. G. Blom, and J. M. Sanz-Serna. An adaptive moving grid method for one dimensional systems of partial differential equations. Journal of Computational Physics, 82(2):454-486, 1989.

[ZCO10] Yao Zhang, Jonathan Cohen, and John D. Owens. Fast tridiagonal solvers on the gpu. SIGPLAN Not., 45(5):127-136, Jan 2010.

[Zwa11] M. N. Zwarts. A test set for an adaptive moving grid pde solver with time-dependent adaptivity. Master’s thesis. Utrecht University, Utrecht, Netherlands, 2011.

Written by lewellen

2015-11-30 at 1:58 pm

## k-Means Clustering using CUDAfy.NET

### Introduction

I’ve been wanting to learn how to utilize general purpose graphics processing units (GPGPUs) to speed up computation intensive machine learning algorithms, so I took some time to test the waters by implementing a parallelized version of the unsupervised k-means clustering algorithm using CUDAfy.NET– a C# wrapper for doing parallel computation on CUDA-enabled GPGPUs. I’ve also implemented sequential and parallel versions of the algorithm in C++ (Windows API), C# (.NET, CUDAfy.NET), and Python (scikit-learn, numpy) to illustrate the relative merits of each technology and paradigm on three separate benchmarks: varying point quantity, point dimension, and cluster quantity. I’ll cover the results, and along the way talk about performance and development considerations of the three technologies before wrapping up with how I’d like to utilize the GPGPU on more involved machine learning algorithms in the future.

### Algorithms

#### Sequential

The traditional algorithm attributed to [Stu82] begins as follows:

1. Pick $K$ points at random as the starting centroid of each cluster.
2. do (until convergence)
1. For each point in data set:
1. labels[point] = Assign(point, centroids)
2. centroids = Aggregate(points, labels)
3. convergence = DetermineConvergence()
3. return centroids

Assign labels each point with the label of the nearest centroid, and Aggregate updates the positions of the centroids based on the new point assignments. In terms of complexity, let’s start with the Assign routine. For each of the $N$ points we’ll compute the distance to each of the $K$ centroids and pick the centroid with the shortest distance that we’ll assign to the point. This is an example of the Nearest Neighbor Search problem. Linear search gives $\mathcal{O}( K N )$ which is preferable to using something like k-d trees which requires repeated superlinear construction and querying. Assuming Euclidean distance and points from $\mathbb{R}^d$, this gives time complexity $\mathcal{O}( d K N )$. The Aggregate routine will take $\mathcal{O}(d K N)$. Assuming convergence is guaranteed in $I$ iterations then the resulting complexity is $\mathcal{O}(d K N I)$ which lends to an effectively linear algorithm.

#### Parallel

[LiFa89] was among the first to study several different shared memory parallel algorithms for k-means clustering, and here I will be going with the following one:

1. Pick $K$ points at random as the starting centroid of each cluster.
2. Partition $N$ points into $P$ equally sized sets
3. Run to completion threadId from 1 to $P$ as:
1. do (until convergence)
1. sum, count = zero($K * d$), zero($K$)
2. For each point in partition[threadId]:
1. label = Assign(point, centroids)
2. For each dim in point:
1. sum[$d$ * label + dim] += point[dim]
3. count[label] = count[label] + 1
3. if(barrier.Synchronize())
1. centroids = sum / count
2. convergence = DetermineConvergence()
4. return centroids

The parallel algorithm can be viewed as $P$ smaller instances of the sequential algorithm processing $N/P$ chunks of points in parallel. There are two main departures from the sequential approach 1) future centroid positions are accumulated and counted after each labeling and 2) each iteration of $P$ while loops are synchronized before continuing on to the next iteration using a barrier – a way of ensuring all threads wait for the last thread to arrive, then continue to wait as the last one enters the barrier, and exits allowing the other threads to exit.

In terms of time complexity, Assign remains unchanged at $\mathcal{O}(d K)$, and incrementing the sums and counts for the point’s label takes time $\mathcal{O}(d + 1)$. Thus for $N/P$ points, a single iteration of the loop gives $\mathcal{O}( N/P (d K + d + 1) )$ time. Given $P$ threads, the maximum time would be given by the thread that enters the barrier, and assuming at most $I$ iterations, then the overall complexity is $\mathcal{O}(d I ( N (K + 1) + K P + 1 ) / P)$. Which suggests we should see at most a $\mathcal{O}(K P / (K + 1))$ speedup over the sequential implementation for large values of $N$.

#### GPGPU

The earliest work I found on doing k-means clustering on NVIDIA hardware in the academic literature was [MaMi09]. The following is based on that work, and the work I did above on the parallel algorithm:

1. Pick $K$ points at random as the starting centroid of each cluster.
2. Partition $N$ into $B$ blocks such that each block contains no more than $T$ points
3. do (until convergence)
1. Initialize sums, counts to zero
2. Process blockId 1 to $B$, $SM$ at a time in parallel on the GPGPU:
1. Initialize blockSum, blockCounts to zero
3. label = Assign(points[blockId * $T$ + threadId], centroids)
4. For each dim in points[blockId * $T$ + threadId]:
1. atomic blockSum[label * pointDim + dim] += points[blockId * $T$ + threadId]
5. atomic blockCount[label] += 1
1. atomic sums += blockSum
2. atomic counts += blockCounts
3. centroids = sums / counts
4. convergence = DetermineConvergence()

The initialization phase is similar to the parallel algorithm, although now we need to take into account the way that the GPGPU will process data. There are a handful of Streaming Multiprocessors on the GPGPU that process a single “block” at a time. Here we assign no more than $T$ points to a block such that each point runs as a single thread to be executed on each of the CUDA cores of the Streaming Multiprocessor.

When a single block is executing we’ll initialize the running sum and count as we did in the parallel case, then request that the threads running synchronize, then proceed to calculate the label of the point assigned to the thread atomically update the running sum and count. The threads must then synchronize again, and this time only the very first thread atomically copy those block level sum and counts over to the global sum and counts shared by all of the blocks.

Let’s figure out the time complexity. A single thread in a block being executed by a Streaming Multiprocessor takes time $\mathcal{O}( 2K + (3K + 1)d + 1 )$ assuming that all $T$ threads of the block execute in parallel, that there are $B$ blocks, and $S$ Streaming Multiprocessors, then the complexity becomes: $\mathcal{O}(B / S (2K + (3K + 1)d + 1) )$. Since $B = N / T$, and at most $I$ iterations can go by in parallel, we are left with $\mathcal{O}( I N (2K + (3K + 1)d + 1) / T S )$. So the expected speedup over the sequential algorithm should be $\mathcal{O}( d K T S / (2K + (3K + 1)d + 1) )$.

#### Expected performance

For large values of $N$, if we allow $K$ to be significantly larger than $d$, we should expect the parallel version to 8x faster than the sequential version and the GPGPU version to be 255x faster than the sequential version given that $P = 8, S = 2, T = 512$ for the given set of hardware that will be used to conduct tests. For $d$ to be significantly larger than $K$, then parallel is the same, and GPGPU version should be 340x faster than the sequential version. Now, it’s very important to point out that these are upper bounds. It is most likely that observed speedups will be significantly less due to technical issues like memory allocation, synchronization, and caching issues that are not incorporated (and difficult to incorporate) into the calculations.

### Implementations

I’m going to skip the sequential implementation since it’s not interesting. Instead, I’m going to cover the C++ parallel and C# GPGPU implementations in detail, then briefly mention how scikit-learn was configured for testing.

#### C++

The parallel Windows API implementation is straightforward. The following will begin with the basic building blocks, then get into the high level orchestration code. Let’s begin with the barrier implementation. Since I’m running on Windows 7, I’m unable to use the convenient InitializeSynchronizationBarrier, EnterSynchronizationBarrier, and DeleteSynchronizationBarrier API calls beginning with Windows 8. Instead I opted to implement a barrier using a condition variable and critical section as follows:

// ----------------------------------------------------------------------------
// Synchronization utility functions
// ----------------------------------------------------------------------------

struct Barrier {
CONDITION_VARIABLE conditionVariable;
CRITICAL_SECTION criticalSection;
int atBarrier;
int expectedAtBarrier;
};

void deleteBarrier(Barrier* barrier) {
DeleteCriticalSection(&(barrier->criticalSection));
// No API for delete condition variable
}

void initializeBarrier(Barrier* barrier, int numThreads) {
barrier->atBarrier = 0;

InitializeConditionVariable(&(barrier->conditionVariable));
InitializeCriticalSection(&(barrier->criticalSection));
}

bool synchronizeBarrier(Barrier* barrier, void (*func)(void*), void* data) {
bool lastToEnter = false;

EnterCriticalSection(&(barrier->criticalSection));

++(barrier->atBarrier);

if (barrier->atBarrier == barrier->expectedAtBarrier) {
barrier->atBarrier = 0;
lastToEnter = true;

func(data);

WakeAllConditionVariable(&(barrier->conditionVariable));
}
else {
SleepConditionVariableCS(&(barrier->conditionVariable), &(barrier->criticalSection), INFINITE);
}

LeaveCriticalSection(&(barrier->criticalSection));

return lastToEnter;
}


A Barrier struct contains the necessary details of how many threads have arrived at the barrier, how many are expected, and structs for the condition variable and critical section.

When a thread arrives at the barrier (synchronizeBarrier) it requests the critical section before attempting to increment the atBarrier variable. It checks to see if it is the last to arrive, and if so, resets the number of threads at the barrier to zero and invokes the callback to perform post barrier actions exclusively before notifying the other threads through the condition variable that they can resume. If the thread is not the last to arrive, then it goes to sleep until the condition variable is invoked. The reason why LeaveCriticalSection is included outside the the if statement is because SleepConditionVariableCS will release the critical section before putting the thread to sleep, then reacquire the critical section when it awakes. I don’t like that behavior since its an unnecessary acquisition of the critical section and slows down the implementation.

There is a single allocation routine which performs a couple different rounds of error checking when calling calloc; first to check if the routine returned null, and second to see if it set a Windows error code that I could inspect from GetLastError. If either event is true, the application will terminate.

// ----------------------------------------------------------------------------
// Allocation utility functions
// ----------------------------------------------------------------------------

void* checkedCalloc(size_t count, size_t size) {
SetLastError(NO_ERROR);

void* result = calloc(count, size);
DWORD lastError = GetLastError();

if (result == NULL) {
fprintf(stdout, "Failed to allocate %d bytes. GetLastError() = %d.", size, lastError);
ExitProcess(EXIT_FAILURE);
}

if (result != NULL && lastError != NO_ERROR) {
fprintf(stdout, "Allocated %d bytes. GetLastError() = %d.", size, lastError);
ExitProcess(EXIT_FAILURE);
}

return result;
}


Now on to the core of the implementation. A series of structs are specified for those data that are shared (e.g., points, centroids, etc) among the threads, and those that are local to each thread (e.g., point boundaries, partial results).

// ----------------------------------------------------------------------------
// Parallel Implementation
// ----------------------------------------------------------------------------

struct LocalAssignData;

struct SharedAssignData {
Barrier barrier;
bool continueLoop;

int numPoints;
int pointDim;
int K;

double* points;
double* centroids;
int* labels;

int maxIter;
double change;
double pChange;

DWORD numProcessors;

LocalAssignData* local;
};

struct LocalAssignData {
SharedAssignData* shared;
int begin;
int end;

int* labelCount;
double* partialCentroids;
};


The assign method does exactly what was specified in the parallel algorithm section. It will iterate over the portion of points it is responsible for, compute their labels and its partial centroids (sum of points with label $k$, division done at aggregate step.).

void assign(int* label, int begin, int end, int* labelCount, int K, double* points, int pointDim, double* centroids, double* partialCentroids) {
int* local = (int*)checkedCalloc(end - begin, sizeof(int));

int* localCount = (int*)checkedCalloc(K, sizeof(int));
double* localPartial = (double*)checkedCalloc(pointDim * K, sizeof(double));

// Process a chunk of the array.
for (int point = begin; point < end; ++point) {
double optDist = INFINITY;
int optCentroid = -1;

for (int centroid = 0; centroid < K; ++centroid) {
double dist = 0.0;
for (int dim = 0; dim < pointDim; ++dim) {
double d = points[point * pointDim + dim] - centroids[centroid * pointDim + dim];
dist += d * d;
}

if (dist < optDist) {
optDist = dist;
optCentroid = centroid;
}
}

local[point - begin] = optCentroid;
++localCount[optCentroid];

for (int dim = 0; dim < pointDim; ++dim)
localPartial[optCentroid * pointDim + dim] += points[point * pointDim + dim];
}

memcpy(&label[begin], local, sizeof(int) * (end - begin));
free(local);

memcpy(labelCount, localCount, sizeof(int) * K);
free(localCount);

memcpy(partialCentroids, localPartial, sizeof(double) * pointDim * K);
free(localPartial);
}


One thing that I experimented with that gave me better performance was allocating and using memory within the function instead of allocating the memory outside and using within the assign routine. This in particular was motivated after I read about false sharing where two separate threads writing to the same cache line cause coherence updates to cascade in the CPU causing overall performance to degrade. For labelCount and partialCentroids they’re reallocated since I was concerned about data locality and wanted the three arrays to be relatively in the same neighborhood of memory. Speaking of which, memory coalescing is used for the points array so that point dimensions are adjacent in memory to take advantage of caching. Overall, a series of cache friendly optimizations.

The aggregate routine follows similar set of enhancements. The core of the method is to compute the new centroid locations based on the partial sums and centroid assignment counts given by args->shared->local[t].partialCentroids and args->shared->local[t].labelCount[t]. Using these partial results all the routine to complete in $\mathcal{O}(P K d)$ time which assuming all of these parameters are significantly less than $N$, gives a constant time routine. Once the centroids have been updated, the change in their location is computed and used to determine convergence along with how many iterations have gone by. Here if more than 1,000 iterations have occurred or the relative change in position is less than some tolerance (0.1%) then the threads will terminate.

void aggregate(void * data) {
LocalAssignData* args = (LocalAssignData*)data;

int* assignmentCounts = (int*)checkedCalloc(args->shared->K, sizeof(int));
double* newCentroids = (double*)checkedCalloc(args->shared->K * args->shared->pointDim, sizeof(double));

// Compute the assignment counts from the work the threads did.
for (int t = 0; t < args->shared->numThreads; ++t)
for (int k = 0; k < args->shared->K; ++k)
assignmentCounts[k] += args->shared->local[t].labelCount[k];

// Compute the location of the new centroids based on the work that the
for (int t = 0; t < args->shared->numThreads; ++t)
for (int k = 0; k < args->shared->K; ++k)
for (int dim = 0; dim < args->shared->pointDim; ++dim)
newCentroids[k * args->shared->pointDim + dim] += args->shared->local[t].partialCentroids[k * args->shared->pointDim + dim];

for (int k = 0; k < args->shared->K; ++k)
for (int dim = 0; dim < args->shared->pointDim; ++dim)
newCentroids[k * args->shared->pointDim + dim] /= assignmentCounts[k];

// See by how much did the position of the centroids changed.
args->shared->change = 0.0;
for (int k = 0; k < args->shared->K; ++k)
for (int dim = 0; dim < args->shared->pointDim; ++dim) {
double d = args->shared->centroids[k * args->shared->pointDim + dim] - newCentroids[k * args->shared->pointDim + dim];
args->shared->change += d * d;
}

// Store the new centroid locations into the centroid output.
memcpy(args->shared->centroids, newCentroids, sizeof(double) * args->shared->pointDim * args->shared->K);

// Decide if the loop should continue or terminate. (max iterations
// exceeded, or relative change not exceeded.)
args->shared->continueLoop = args->shared->change > 0.001 * args->shared->pChange && --(args->shared->maxIter) > 0;

args->shared->pChange = args->shared->change;

free(assignmentCounts);
free(newCentroids);
}


Each individual thread follows the same specification as given in the parallel algorithm section, and follows the calling convention required by the Windows API.

DWORD WINAPI assignThread(LPVOID data) {
LocalAssignData* args = (LocalAssignData*)data;

while (args->shared->continueLoop) {
memset(args->labelCount, 0, sizeof(int) * args->shared->K);

// Assign points cluster labels
assign(args->shared->labels, args->begin, args->end, args->labelCount, args->shared->K, args->shared->points, args->shared->pointDim, args->shared->centroids, args->partialCentroids);

// Tell the last thread to enter here to aggreagate the data within a
// critical section
synchronizeBarrier(&(args->shared->barrier), aggregate, args);
};

return 0;
}


The parallel algorithm controller itself is fairly simple and is responsible for basic preparation, bookkeeping, and cleanup. The number of processors is used to determine the number of threads to launch. The calling thread will run one instance will the remaining $P - 1$ instances will run on separate threads. The data is partitioned, then the threads are spawned using the CreateThread routine. I wish there was a Windows API that would allow me to simultaneously create $P$ threads with a specified array of arguments because CreateThread will automatically start the thread as soon as it’s created. If lots of threads are being created, then the first will wait a long time before the last one gets around to reaching the barrier. Subsequent iterations of the synchronized loops will have better performance, but it would be nice to avoid that initial delay. After kicking off the threads, the main thread will run its own block of data, and once all threads terminate, the routine will close open handles and free allocated memory.

void kMeansFitParallel(double* points, int numPoints, int pointDim, int K, double* centroids) {
// Lookup and calculate all the threading related values.
SYSTEM_INFO systemInfo;
GetSystemInfo(&systemInfo);

DWORD numProcessors = systemInfo.dwNumberOfProcessors;
DWORD numThreads = numProcessors - 1;
DWORD pointsPerProcessor = numPoints / numProcessors;

// Prepare the shared arguments that will get passed to each thread.
SharedAssignData shared;
shared.numPoints = numPoints;
shared.pointDim = pointDim;
shared.K = K;
shared.points = points;

shared.continueLoop = true;
shared.maxIter = 1000;
shared.pChange = 0.0;
shared.change = 0.0;
shared.numProcessors = numProcessors;

initializeBarrier(&(shared.barrier), numProcessors);

shared.centroids = centroids;
for (int i = 0; i < K; ++i) {
int point = rand() % numPoints;
for (int dim = 0; dim < pointDim; ++dim)
shared.centroids[i * pointDim + dim] = points[point * pointDim + dim];
}

shared.labels = (int*)checkedCalloc(numPoints, sizeof(int));

LocalAssignData* local = (LocalAssignData*)checkedCalloc(numProcessors, sizeof(LocalAssignData));
for (int i = 0; i < numProcessors; ++i) {
local[i].shared = &shared;
local[i].begin = i * pointsPerProcessor;
local[i].end = min((i + 1) * pointsPerProcessor, numPoints);
local[i].labelCount = (int*)checkedCalloc(K, sizeof(int));
local[i].partialCentroids = (double*)checkedCalloc(K * pointDim, sizeof(double));
}

shared.local = local;

for (int i = 0; i < numThreads; ++i)

// Do work on this thread so that it's just not sitting here idle while the
// other threads are doing work.

// Clean up
for (int i = 0; i < numThreads; ++i)

for (int i = 0; i < numProcessors; ++i) {
free(local[i].labelCount);
free(local[i].partialCentroids);
}

free(local);

free(shared.labels);

deleteBarrier(&(shared.barrier));
}


#### C#

The CUDAfy.NET GPGPU C# implementation required a lot of experimentation to find an efficient solution.

In the GPGPU paradigm there is a host and a device in which sequential operations take place on the host (ie. managed C# code) and parallel operations on the device (ie. CUDA code). To delineate between the two, the [Cudafy] method attribute is used on the static public method assign. The set of host operations are all within the Fit routine.

Under the CUDA model, threads are bundled together into blocks, and blocks together into a grid. Here the data is partitioned so that each block consists of half the maximum number of threads possible per block and the total number of blocks is the number of points divided by that quantity. This was done through experimentation, and motivated by Thomas Bradley’s Advanced CUDA Optimization workshop notes [pdf] that suggest at that regime the memory lines become saturated and cannot yield better throughput. Each block runs on a Streaming Multiprocessor (a collection of CUDA cores) having shared memory that the threads within the block can use. These blocks are then executed in pipeline fashion on the available Streaming Multiprocessors to give the desired performance from the GPGPU.

What is nice about the shared memory is that it is much faster than the global memory of the GPGPU. (cf. Using Shared Memory in CUDA C/C++) To make use of this fact the threads will rely on two arrays in shared memory: sum of the points and the count of those belonging to each centroid. Once the arrays have been zeroed out by the threads, all of the threads will proceed to find the nearest centroid of the single point they are assigned to and then update those shared arrays using the appropriate atomic operations. Once all of the threads complete that assignment, the very first thread will then add the arrays in shared memory to those in the global memory using the appropriate atomic operations.

using Cudafy;
using Cudafy.Host;
using Cudafy.Translator;
using Cudafy.Atomics;
using System;

namespace CUDAfyTesting {
public class CUDAfyKMeans {
[Cudafy]
public static void assign(GThread thread, int[] constValues, double[] centroids, double[] points, float[] outputSums, int[] outputCounts) {
// Unpack the const value array
int pointDim = constValues[0];
int K = constValues[1];
int numPoints = constValues[2];

// Ensure that the point is within the boundaries of the points
// array.
if (point >= numPoints)
return;

// Use two shared arrays since they are much faster than global
// memory. The shared arrays will be scoped to the block that this

// Accumulate the each point's dimension assigned to the k'th
// centroid. When K = 128 => pointDim = 2; when pointDim = 128
// => K = 2; Thus max(len(sharedSums)) = 256.
if (tId < K * pointDim)
sharedSums[tId] = 0.0f;

// Keep track of how many times the k'th centroid has been assigned
// to a point. max(K) = 128
if (tId < K)
sharedCounts[tId] = 0;

// Make sure all threads share the same shared state before doing
// any calculations.

// Find the optCentroid for point.
double optDist = double.PositiveInfinity;
int optCentroid = -1;

for (int centroid = 0; centroid < K; ++centroid) {
double dist = 0.0;
for (int dim = 0; dim < pointDim; ++dim) {
double d = centroids[centroid * pointDim + dim] - points[point * pointDim + dim];
dist += d * d;
}

if (dist < optDist) {
optDist = dist;
optCentroid = centroid;
}
}

// Add the point to the optCentroid sum
for (int dim = 0; dim < pointDim; ++dim)
// CUDA doesn't support double precision atomicAdd so cast down
// to float...

// Increment the optCentroid count

// Wait for all of the threads to complete populating the shared
// memory before storing the results back to global memory where
// the host can access the results.

// Have to do a lock on both of these since some other Streaming
// Multiprocessor could be running and attempting to update the
// values at the same time.

// Copy the shared sums to the output sums
if (tId == 0)
for (int i = 0; i < K * pointDim; ++i)

// Copy the shared counts to the output counts
if (tId == 0)
for (int i = 0; i < K; i++)
}


Before going on to the Fit method, let’s look at what CUDAfy.NET is doing under the hood to convert the C# code to run on the CUDA-enabled GPGPU. Within the CUDAfy.Translator namespace there are a handful of classes for decompiling the application into an abstract syntax tree using ICharpCode.Decompiler and Mono.Cecil, then converting the AST over to CUDA C via visitor pattern, next compiling the resulting CUDA C using NVIDIA’s NVCC compiler, and finally the compilation result is relayed back to the caller if there’s a problem; otherwise, a CudafyModule instance is returned, and the compiled CUDA C code it represents loaded up on the GPGPU. (The classes and method calls of interest are: CudafyTranslator.DoCudafy, CudaLanguage.RunTransformsAndGenerateCode, CUDAAstBuilder.GenerateCode, CUDAOutputVisitor and CudafyModule.Compile.)

        private CudafyModule cudafyModule;
private GPGPU gpgpu;
private GPGPUProperties properties;

public int PointDim { get; private set; }
public double[] Centroids { get; private set; }

public CUDAfyKMeans() {
cudafyModule = CudafyTranslator.Cudafy();

gpgpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
properties = gpgpu.GetDeviceProperties(true);

}


The Fit method follows the same paradigm that I presented earlier with the C++ code. The main difference here is the copying of managed .NET resources (arrays) over to the device. I found these operations to be relatively time intensive and I did find some suggestions from the CUDAfy.NET website on how to use pinned memory- essentially copy the managed memory to unmanaged memory, then do an asynchronous transfer from the host to the device. I tried this with the points arrays since its the largest resource, but did not see noticeable gains so I left it as is.

At the beginning of each iteration of the main loop, the device counts and sums are cleared out through the Set method, then the CUDA code is invoked using the Launch routine with the specified block and grid dimensions and device pointers. One thing that the API does is return an array when you allocate or copy memory over to the device. Personally, an IntPtr seems more appropriate. Execution of the routine is very quick, where on some of my tests it took 1 to 4 ms to process 100,000 two dimensional points. Once the routine returns, memory from the device (sum and counts) is copied back over to the host which then does a quick operation to derive the new centroid locations and copy that memory over to the device for the next iteration.

        public void Fit(double[] points, int pointDim, int K) {
if (K <= 0)
throw new ArgumentOutOfRangeException("K", "Must be greater than zero.");

if (pointDim <= 0)
throw new ArgumentOutOfRangeException("pointDim", "Must be greater than zero.");

if (points.Length < pointDim)
throw new ArgumentOutOfRangeException("points", "Must have atleast pointDim entries.");

if (points.Length % pointDim != 0)
throw new ArgumentException("points.Length must be n * pointDim > 0.");

int numPoints = points.Length / pointDim;

// Figure out the partitioning of the data.
int numBlocks = (numPoints / threadsPerBlock) + (numPoints % threadsPerBlock > 0 ? 1 : 0);

dim3 blockSize = new dim3(threadsPerBlock, 1, 1);

dim3 gridSize = new dim3(
Math.Min(properties.MaxGridSize.x, numBlocks),
Math.Min(properties.MaxGridSize.y, (numBlocks / properties.MaxGridSize.x) + (numBlocks % properties.MaxGridSize.x > 0 ? 1 : 0)),
1
);

int[] constValues = new int[] { pointDim, K, numPoints };
float[] assignmentSums = new float[pointDim * K];
int[] assignmentCount = new int[K];

// Initial centroid locations picked at random
Random prng = new Random();
double[] centroids = new double[K * pointDim];
for (int centroid = 0; centroid < K; centroid++) {
int point = prng.Next(points.Length / pointDim);
for (int dim = 0; dim < pointDim; dim++)
centroids[centroid * pointDim + dim] = points[point * pointDim + dim];
}

// These arrays are only read from on the GPU- they are never written
// on the GPU.
int[] deviceConstValues = gpgpu.CopyToDevice<int>(constValues);
double[] deviceCentroids = gpgpu.CopyToDevice<double>(centroids);
double[] devicePoints = gpgpu.CopyToDevice<double>(points);

// These arrays are written written to on the GPU.
float[] deviceSums = gpgpu.CopyToDevice<float>(assignmentSums);
int[] deviceCount = gpgpu.CopyToDevice<int>(assignmentCount);

// Set up main loop so that no more than maxIter iterations take
// place, and that a realative change less than 1% in centroid
// positions will terminate the loop.
int maxIter = 1000;
double change = 0.0, pChange = 0.0;

do {
pChange = change;

// Clear out the assignments, and assignment counts on the GPU.
gpgpu.Set(deviceSums);
gpgpu.Set(deviceCount);

// Lauch the GPU portion
gpgpu.Launch(gridSize, blockSize, "assign", deviceConstValues, deviceCentroids, devicePoints, deviceSums, deviceCount);

// Copy the results memory from the GPU over to the CPU.
gpgpu.CopyFromDevice<float>(deviceSums, assignmentSums);
gpgpu.CopyFromDevice<int>(deviceCount, assignmentCount);

// Compute the new centroid locations.
double[] newCentroids = new double[centroids.Length];
for (int centroid = 0; centroid < K; ++centroid)
for (int dim = 0; dim < pointDim; ++dim)
newCentroids[centroid * pointDim + dim] = assignmentSums[centroid * pointDim + dim] / assignmentCount[centroid];

// Calculate how much the centroids have changed to decide
// whether or not to terminate the loop.
change = 0.0;
for (int centroid = 0; centroid < K; ++centroid)
for (int dim = 0; dim < pointDim; ++dim) {
double d = newCentroids[centroid * pointDim + dim] - centroids[centroid * pointDim + dim];
change += d * d;
}

// Update centroid locations on CPU & GPU
Array.Copy(newCentroids, centroids, newCentroids.Length);
deviceCentroids = gpgpu.CopyToDevice<double>(centroids);

} while (change > 0.01 * pChange && --maxIter > 0);

gpgpu.FreeAll();

this.Centroids = centroids;
this.PointDim = pointDim;
}
}
}


#### Python

I include the Python implementation for the sake of demonstrating how scikit-learn was invoked throughout the following experiments section.

model = KMeans(
n_clusters = numClusters,
init='random',
n_init = 1,
max_iter = 1000,
tol = 1e-3,
precompute_distances = False,
verbose = 0,
copy_x = False,
);

model.fit(X);    // X = (numPoints, pointDim) numpy array.


### Experimental Setup

All experiments where conducted on a laptop with an Intel Core i7-2630QM Processor and NVIDIA GeForce GT 525M GPGPU running Windows 7 Home Premium. C++ and C# implementations were developed and compiled by Microsoft Visual Studio Express 2013 for Desktop targeting C# .NET Framework 4.5 (Release, Mixed Platforms) and C++ (Release, Win32). Python implementation was developed and compiled using Eclipse Luna 4.4.1 targeting Python 2.7, scikit-learn 0.16.0, and numpy 1.9.1. All compilers use default arguments and no extra optimization flags.

For each test, each reported test point is the median of thirty sample run times of a given algorithm and set of arguments. Run time is computed as the (wall) time taken to execute model.fit(points, pointDim, numClusters) where time is measured by: QueryPerformanceCounter in C++, System.Diagnostics.Stopwatch in C#, and time.clock in Python. Every test is based on a dataset having two natural clusters at .25 or -.25 in each dimension.

### Results

#### Varying point quantity

Both the C++ and C# sequential and parallel implementations outperform the Python scikit-learn implementations. However, the C++ sequential and parallel implementations outperforms their C# counterparts. Though the C++ sequential and parallel implementations are tied, as it seems the overhead associated with multithreading overrides any multithreaded performance gains one would expect. The C# CUDAfy.NET implementation surprisingly does not outperform the C# parallel implementation, but does outperform the C# sequential one as the number of points to cluster increases.

So what’s the deal with Python scikit-learn? Why is the parallel version so slow? Well, it turns out I misunderstood the nJobs parameter. I interpreted this to mean that process of clustering a single set of points would be done in parallel; however, it actually means that the number of simultaneous runs of the whole process will occur in parallel. I was tipped off to this when I noticed multiple python.exe fork processes being spun off which surprised me that someone would implement a parallel routine that way leading to a more thorough reading the scikit-learn documentation. There is parallelism going on with scikit-learn, just not the desired type. Taking that into account the linear one performs reasonably well for being a dynamically typed interpreted language.

#### Varying point dimension

The C++ and C# parallel implementations exhibit consistent improved run time over their sequential counterparts. In all cases the performance is better than scikit-learn’s. Surprisingly, the C# CUDAfy.NET implementation does worse than both the C# sequential and parallel implementations. Why do we not better CUDAfy.NET performance? The performance we see is identical to the vary point quantity test. So on one hand it’s nice that increasing the point dimensions did not dramatically increase the run time, but ideally, the CUDAfy.NET performance should be better than the sequential and parallel C# variants for this test. My leading theory is that higher point dimensions result in more data that must be transferred between host and device which is a relatively slow process. Since I’m short on time, this will have to be something I investigate in more detail in the future.

#### Varying cluster quantity

As in the point dimension test, the C++ and C# parallel implementations outperform their sequential counterparts, while the scikit-learn implementation starts to show some competitive performance. The exciting news of course is that varying the cluster size finally reveals improved C# CUDAfy.NET run time. Now there is some curious behavior at the beginning of each plot. We get $\le 10 \text{ ms}$ performance for two clusters, then jump up into about $\le 100 \text{ ms}$ for four to eight clusters. Number of points and their dimension are held constant, but we allocate a few extra double’s for the cluster centroids. I believe this has to do with cache behavior. I’m assuming for fewer than four clusters everything that’s needed sits nicely in the fast L1 cache, and moving up to four and more clusters requires more exchanging of data between L1, L2, L3, and (slower) memory memory to the different cores of the Intel Core i7-2630QM processor I’m using. As before, I’ll need to do some more tests to verify that this is what is truly happening.

#### Language comparison

For the three tests considered, the C++ implementations gave the best run time performance on point quantity and point dimension tests while the C# CUDAfy.NET implementation gave the best performance on the cluster quantity test.

The C++ implementation could be made to run faster be preallocating memory in the same fashion that C# does. In C# when an application is first created a block of memory is allocated for the managed heap. As a result, allocation of reference types in C# is done by incrementing a pointer instead of doing an unmanaged allocation (malloc, etc.). (cf. Automatic Memory Management) This allocation takes place before executing the C# routines, while the same allocation takes place during the C++ routines. Hence, the C++ run times will have an overhead not present in the C# run times. Had I implemented memory allocation in C++ the same as it’s done in C#, then the C++ implementation would be undoubtedly even faster than the C# ones.

While using scikit-learn in Python is convenient for exploratory data analysis and prototyping machine learning algorithms, it leaves much to be desired in performance; frequently coming ten times slower than the other two implementations on the varying point quantity and dimension tests, but within tolerance on the vary cluster quantity tests.

### Future Work

The algorithmic approach here was to parallelize work on data points, but as the dimension of each point increases, it may make sense to explore algorithms that parallelize work across dimensions instead of points.

I’d like to spend more time figuring out some of the high-performance nuances of programming the GPGPU (as well as traditional C++), which take more time and patience than a week or two I spent on this. In addition, I’d like to dig a little deeper into doing CUDA C directly rather than through the convenient CUDAfy.NET wrapper; as well as explore OpenMP and OpenCL to see how they compare from a development and performance-oriented view to CUDA.

Python and scikit-learn were used a baseline here, but it would be worth spending extra time to see how R and Julia compare, especially the latter since Julia pitches itself as a high-performance solution, and is used for exploratory data analysis and prototyping machine learning systems.

While the emphasis here was on trying out CUDAfy.NET and getting some exposure to GPGPU programming, I’d like to apply CUDAfy.NET to the expectation maximization algorithm for fitting multivariate Gaussian mixture models to a dataset. GMMs are a natural extension of k-means clustering, and it will be good to implement the more involved EM algorithm.

### Conclusions

Through this exercise, we can expect to see modest speedups over sequential implementations of about 2.62x and 11.69x in the C# parallel and GPGPU implementations respectively when attempting to find large numbers of clusters on low dimensional data. Fortunately the way you use k-means clustering is to find the cluster quantity that maximizes the Bayesian information criterion or Akaike information criterion which means running the vary centroid quantity test on real data. On the other hand, most machine learning data is of a high dimension so further testing (on a real data set) would be needed to verify it’s effectiveness in a production environment. Nonetheless, we’ve seen how parallel and GPGPU based approaches can reduce the time it takes to complete the clustering task, and learned some things along the way that can be applied to future work.

### Bibliography

[LiFa89] Li Xiaobo and Fang Zhixi, “Parallel clustering algorithms”, Parallel Computing, 1989, 11(3): pp.275-290.

[MaMi09] Mario Zechner, Michael Granitzer. “Accelerating K-Means on the Graphics Processor via CUDA.” First International Conference on Intensive Applications and Services, INTENSIVEâ€™09. pp. 7-15, 2009.

[Stu82] Stuart P. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28:129-137, 1982.

Written by lewellen

2015-09-01 at 8:00 am

Posted in Algorithms

## Notes from SIGGRAPH 2015

### Introduction

I recently flew out to Los Angeles to attend the 42nd International Conference and Exhibition on Computer Graphics and Interactive Techniques. SIGGRAPH‘s theme this year was the crossroads of discovery bringing it closer to its roots that began here in Boulder, Colorado back in 1974. For me it was a chance to dig a little deeper into Computer Graphics research following my recent studies and develop a better understanding of the industries pushing the domain forward. As with most posts on this site, this is a short reminder to myself, and hopefully gives others an idea of what they could expect if they went.

### Production Sessions

Disney – Pixar’s “Lava”: Moving Mountains was an informative production session detailing the process of bringing “Lava” to the screen. “Lava” is the story of Uku, a lonely volcano in search of love. As millions of years go by, he begins to lose hope as he recedes back into the ocean. But all is not lost. Uku finds renewed hope for love as newly formed volcano Lele rises to the surface. After the Pixar magicians reveal their secrets, technical details, and engrossing backstory, “Lava” becomes an even more enjoyable short film.

The presentation began with director James Murphy explaining his personal story inspiring the short before giving a live performance of the titular song. Colin Levy followed Murphy’s conceptualization, story boarding, and clay mockups with how the film would be framed for maximal emotional impact. Levy explain the exploratory process of filming the opening scene of the film to find the right combination of lenses, and flight paths based on real-world references to help illustrate the size and scale of Uku, the hopeless volcano.

Both Aaron Hartline and Austin Lee continued discussing the challenges of animating and rigging Uku, Lele, a pair of dolphins, birds, whales, and turtles (the last four representing young love, newly weds, established lives, and life long love). In particular, the different approaches for animating and rigging the facial features of Uku (eyelids, lips, checks, and so on) and how the teams iterated to find a balance between what the audience might expect from an anthropomorphic mountain and what they wanted to achieve as story tellers.

Perhaps the most interesting moment in the presentation was Dirk Van Gelder’s sneak peak of the enhancements the team made to Presto (Pixar’s in-house animation tool) to provide animators final render quality real-time feedback of their changes through a clever combination of Renderman-based final renders and OpenGL hardware texturing. Aside from the technical novelty, it’s a great example of time saving enhancements that make it easier for people to freely experiment and explore different approaches leading to better results.

The closing discussion by Byron Bashforth and Farhez Rayani on shading and lighting was informative and it was interesting to see how the procedural approaches were done to give Uku both a physically realistic and visually appealing biome consisting of different shaders, and static and procedural assets. Overall, a very interesting peak into the workflow of one of the most venerable studios in the industry.

### Birds of a Feather

Having worked in the healthcare space for a fair bit of time, I was attracted to meetings on Volume Rendering and Medical Visualization and HealthTech: Modeling, Interaction, Hardware, and Analysis to see what people have been working on and to get a glimpse of where things are heading.

Nicholas Polys of Virginia Tech and Michael Aratow (MD) (both chairs of the Web3D Consortium Medical Working Group) began the medical visualization discussion by going over common libraries such as VTK (The Visualization Toolkit) and Voreen (Volume Rendering Engine), before discussing general purpose analysis and visualization tools such as Paraview. Volume oriented applications such as Seg3D (volume segmentation tool), OsiriX (DICOM viewer) were covered and finally, tools for exploring biomolecular systems such as Chimera, VMD (Visual Molecular Dynamics) and PathSim (Epstein-Barr Virus exploration) were discussed giving the audience a good lay of the land. Brief bit of time was given to surgical training tools based on 3D technologies and haptic feedback (e.g. H3D).

These were all interesting applications and seeing how they all work using different types of human-machine interfaces (standard workstations, within CAVE environments, or even in virtual reality headsets and gloves) was eye opening. The second main theme of the discussion was on standardization when it comes to interoperability and reproducibility. There was a heavy push for X3D along with interoperability with DICOM. Like a lot of massive standards, DICOM has some wiggle room in it that leads to inconsistent implementations from vendors. That makes portability of data between disparate systems complicated (not to mention DICOM incorporates non-graphical metadata such as complex HL7). Suffice to say X3D is biting off a big chunk of work, and I think it will take some time for them to make progress in healthcare since it’s a fragmented industry that is not in the least bit technologically progressive.

One area I felt was absent during the discussion was how 3D graphics could be used to benefit everyday patients. There is a wealth of fMRI and ECoG data that patients could benefit from seeing in an accessible way- for example showing a patient a healthy baseline, then accentuating parts of their own data and explaining how those anomalies affect their well-being. If a component can be developed to deliver that functionality, then it can be incorporated into a patient portal alongside all other charts and information that providers have accumulated for the patient.

The HealthTech discussion was presented by Ramesh Raskar, and his graduate students and postdocs from the MIT Media Lab. They presented a number of low-cost, low-power diagnostic devices for retinal imaging and electroretinography, high-speed tomography, cellphone-based microscopy, skin perfusion photography, and dental imaging. Along with more social oriented technologies for identify safe streets to travel, and automatically discerning mental health from portraits. There were plenty of interesting applications being developed by the group, but it was more of a show and tell by the group than discussing the types of challenges beyond the scope of the work by MIT Media Lab (as impressive as they are). (For example, The fine work 3Shape A/S has done with fast scanning of teeth for digital dentistry.)

One thing that was discussed of key interest was Meddit a way for medical practitioners and researchers to define open problems to maturity, then presenting those challenges to computer scientists to work on and develop solutions. While the company name is uninspired, I think this is the right kind of collaboration platform for the “toolmaker” view of hardware engineers, computer scientists and software engineers as it identifies a real issue, presents an opportunity, and gives a pool of talented, bright people a way to make a difference. I am skeptical that it will take off (I think it would have more success as a niche community within an umbrella collaboration platform- i.e. Stack Exchange model), but the idea is sound and something people should get excited about.

### Real-Time Live!

The challenge of real-time graphics is very appealing to me and getting to see what different software studios are working on was a real treat. While there were several presentations and awards given during the two hour long event, three demos stood out to me. Balloon Burst given by Miles Macklin of NVIDIA, BabyX presented by Mark Sagar of University of Auckland, and award winner A Boy and His Kite demoed by Nick Penwarden of Epic Games.

Macklin’s demo was impressive in that it simulated more than 750,000 particles (250,000 by their solver Flex, and 512,000 for mist and droplets) and their paper [pdf] Fast Grid-Free Surface Tracking gave some technical background into how they achieved their results. Fluid simulation is something I’d like to spend some time exploring, obviously won’t be able to create something as technical as Macklin’s group, but would like to spend some time on Smoothed-Particle Hydrodynamics, and seeing NVIDIA’s work was a good motivation boost to explore the subject further on my own.

Perhaps the most unexpected entry in the series was Sagar’s BabyX. It was a fascinating assemblage of neural networks, real time graphics, natural language processing, computer vision, and image processing to create the ultimate “Sims” like character- a baby that could learn and invoke different emotional responses based on external stimuli. Real-time graphics were photorealistic, and seeing the modeling behind the system to emulate how the brain behaves in the presence of different dopamine levels (and how those levels correspond to things like Parkinson’s and schizophrenia) was impressive as well. Overall, a fantastic technical achievement and I look forward to following Sagar’s work as it continues to evolve.

My main interest in going to Real-Time Live! was to see Penwarden’s work on A Boy and His Kite. This impressive demo spanning hundred square miles inspired by the Isle of Skye really puts to shame my prior work in creating procedural environments. Nonetheless, it goes to show to far the medium can be pushed and how small the divide between real-time and film is becoming. Computer Graphics World published (July-August 2015) a very thorough technical overview [p. 40-48] of how Penwarden’s team produced the short, in addition to the features added to Unreal Engine 4 to make the demo shine.

### Wrap-up

There were many other things I explored that I won’t go into detail- namely the VR Village, Emerging Technologies, Research Posters, Exhibition, and Job Fair. I’m still quite skeptical that virtual reality (and to the same extent augmented reality) technologies will come into the mainstream; I think they’ll continue to be the subject of researchers, gaming enthusiasts, and industry solutions for automotive, and healthcare problems. One thing that was a bit of a disappointment was the Job Fair as there were barely any companies participating. Overall, a positive experience learning what other people are doing in the industry, and getting to see how research is being applied in a variety of different domains including automotive, entertainment, engineering, healthcare, and science.

Written by lewellen

2015-08-14 at 1:41 pm

Posted in Computer Graphics

Tagged with ,

## Algorithms for Procedurally Generated Environments

with one comment

### Introduction

I recently completed a graduate course in Computer Graphics that required us to demonstrate a significant understanding of OpenGL and general graphics techniques. Given the short amount of time to work with, I chose to work on creating a procedurally generated environment consisting of land, water, trees, a cabin, smoke, and flying insects. The following write-up explains my approach and the established algorithms that were used to create the different visual effects showcased in the video above.

### Terrain

There are a variety of different techniques for creating terrain. More complex ones rely on visualizing a three dimensional scalar field, while simpler ones visualize a two dimensional surface defined by a fixed image, or dynamically using a series of specially crafted functions or random behavior. I chose to take a fractal-based approach given by [FFC82]’s Midpoint Displacement algorithm. A two dimensional grid of size $(2^n + 1)^2$ (for $n > 1$) is allocated with each entry representing the height of the terrain at that row and column position. Beginning with the corners, the corresponding four midpoints, as well as singular center, are found and their height calculated by drawing from a uniform random variable whose support is given by the respective corners. The newly assigned values now form four separate squares which can be assigned values recursively as shown above. The Midpoint Displacement algorithm produces noisy surfaces that are jagged in appearance. To smooth out the results, a $3 \times 3$ Gaussian Filter is applied to the surface twice in order to produce a more natural, smooth looking surface.

Face normals can be found by taking the cross product of the forward differences in both the row and column directions, however this leads to a faceted looking surface. For a more natural appearance, vertex normals are calculated using central differences in the row and column direction. The cross product of these approximations then gives the approximate surface normal at each vertex to ensure proper lighting.

Texturing of the terrain is done by dynamically blending eight separate textures together based on terrain height as shown above. The process begins by loading in a base texture which is applied to all heights of the terrain. The next texture to apply is loaded, and an alpha mask is created and applied to the next texture based on random noise and a specific terrain height band. Blending of the masked texture and base texture is a function of the terrain’s height where the normalized height is passed through a logistic function to decide what portion of each texture should be used. The combined texture then serves as a new base texture and the process repeats until all textures have been blended together.

### Water

In order to generate realistic looking water, a number of different OpenGL abilities were employed to accurately capture a half dozen different water effects consisting of reflections, waves, ripples, lighting, and Fresnel effects. Compared to other elements of the project, water required the largest graphics effort to get right.

To obtain reflections, a three-pass rendering process is used. In the first pass, the scene is clipped and rendered to a frame buffer (with color and depth attachments) from above the water revealing only what is below the surface for the refraction effects. The second pass clips and renders the scene to a frame buffer (with only a color attachment) from below the water surface revealing only what is above the water for the reflection effects. The third pass then combines these buffers on the water surface through vertex and fragment shaders to give the desired appearance. To map the frame buffer renderings to the water surface, the clipped coordinates calculated in the vertex shader are converted to normalized device coordinates through perspective division in the fragment shader which allows one to map the $(u, v)$ coordinate of the texture as it should appear on the screen to coordinates of the water surface.

To create the appearance of water ripples, a normal map is sampled and the resulting time varying displacement is used to sample the reflection texture for the fragment’s reflection color. Next, a similar sampling of the normal map is done at a coarser level to emulate the specular lighting that would appear on the subtle water waves created by the vertex shader. Refraction ripples and caustic lighting are achieved by sampling from the normal map just as the surface ripples and specular lighting were. To make the water appear cloudy, the depth buffer from the refraction rendering is used in conjunction with water depth so that terrain deeper under water is less visible as it would be in real life.

To combine the reflection and refraction components, the Fresnel Effect is used. This effect causes the surface of the water to vary in appearance based on viewing angle. When the viewer’s gaze is shallow to the water surface, the water is dominated by the reflection component, while when gazing downward, the water is more transparent giving way to the refraction component. The final combination effect is to adjust the transparency of the texture near the shore so that shallower water reveals more of the underlying terrain.

### Flora

The scene consists of single cottonwood trees, but the underlying algorithm based on Stochastic Lindermayer Systems [Lin68] which can produce a large variety of flora as shown above. The idea is that an n-ary tree is created with geometric annotations consisting of length and radius, and relative position, yaw, and pitch to its parent node. Internal nodes of the tree are rendered as branches or stems, while leaf nodes as groups of leaves, flowers, fruits and so on. Depending on the type of plant one wishes to generate, these parameters take on different values, and the construction of the n-ary tree varies. For a palm tree, a linked list is created, whereas a flower may have several linked lists sharing a common head, and a bush may be a factorial tree with a regular pitch.

### Smoke

The primary algorithmic challenge of the project was to visualize smoke coming from the chimney of the cabin in the scene. To achieve this, a simple particle system was written in conjunction with [Bli82]’s Metaballs and [PT+90]’s Marching Tetrahedra algorithms. The tetrahedral variant was chosen since it easier to implement from scratch than [LC87]’s original Marching Cubes algorithm. The resulting smoke plumes produced from this chain of algorithms is shown above.

Particles possess position and velocity three dimensional components, and are added at a fixed interval to the system in batches with random initial values on the xy-plane and zero velocity terms. A uniform random vector field is created with random x, y components and fixed, positive z component. Euler’s Forward Method is applied to the system to update each particle’s position and velocity. Any particles that escape from the unit bounding cube are removed from the system. This process produces the desired Brownian paths that are typical of smoke particles. To visualize each particle, the Metaballs algorithm is used to create a potential field about each particle. The three dimensional grid is populated in linear time with respect to the number of particles by iterating over a fixed volume about each particle since there is no need to go outside of the fixed volume where the point’s potential field is surely zero.

The resulting scalar field from this process is passed along to the Marching Tetrahedra algorithm. The algorithm will inspect each volume of the grid in cubic time with respect to the grid edge size. The eight points of the volume are then assigned inside / outside labellings with those volumes completely inside or outside ignored. Those having mixed labellings contain a segment of the surface we wish to render. A single volume is segmented into 6 tetrahedra[1], with two tetrahedra facing each plane resulting in a common corner shared by all as shown above. Each tetrahedron then has sixteen cases to examine leading to different surfaces. Two of these cases are degenerate; all inside or all outside. The remaining fourteen cases can be reduced to two by symmetry as shown above. To ensure the surface is accurate, the surface vertices are found by linearly interpolating between inside / outside grid points.

Face normals can be computed directly from the resulting surface planes by taking the usual cross product. In order to calculate vertex normals, numerical differentiation is used to derive the gradient of the scalar field at the grid point using backward, central, and forward differences depending on availability. Taking the calculated normals at each grid point, the surface normals at the previously interpolated surface vertices are then the linear interpolation of the corresponding grid point normals. Given more time, I would have liked to put more time into surface tracking and related data structures to reduce the cubic surface generation process down to just those volumes that require a surface to be drawn.

### Butterflies

For the final stretch goal, the proposed static options were eschewed in favor of adding in a dynamic element that would help bring the scene to life without being obtrusive. As a result, a kaleidoscope of butterflies that meander through the scene were introduced.

Each butterfly follows a pursuit curve as it chases after an invisible particle following a random walk. Both butterfly and target are assigned positions drawn from a uniform random variable with unit support. At the beginning of a time step, the direction to where the particle will be at the next time step from where the butterfly is at the current time step is calculated, and the butterfly’s position is then incremented in that direction. Once the particle escapes the unit cube, or the butterfly catches the particle, then the particle is assigned a new position, and the game of cat and mouse continues.

Each time step the butterfly’s wings and body are rotated a slight amount by a fixed value, and by the Euler angles defined by its direction to give the correct appearance of flying. To add variety to the butterflies, each takes on one of three different appearances (Monarch, and Blue and White Morpho varieties) based on a fair dice roll. One flaw with the butterflies is that they do not take into account the positions of other objects in the scene and can be often seen flying into the ground, or through the cabin.

### Conclusions

This project discussed a large variety of topics related to introductory computer graphics, but did not cover other details that were developed including navigation, camera control, lighting, algorithms for constructing basic primitives, and the underlying design of the C++ program or implementation of the GLSL shaders. While most of the research applied to this project dates back nearly 35 years, the combination of techniques lends to a diverse and interesting virtual environment. Given more time, additional work could be done to expand the scene to include more procedurally generated plants, objects, and animals, as well as additional work done to make the existing elements look more photorealistic.

### References

[Bli82] James F. Blinn. A generalization of algebraic surface drawing. ACM Trans. Graph., 1(3):235-256, 1982.

[FFC82] Alain Fournier, Donald S. Fussell, and Loren C. Carpenter. Computer render of stochastic models. Commun. ACM, 25(6):371-384, 1982.

[LC87] William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1987, pages 163-169, 1987.

[Lin68] Astrid Lindenmayer. Mathematical models for cellular interactions in development i. filaments with one-side inputs. Journal of theoretical biology, 18(3):280-299, 1968.

[PT+90] Bradley Payne, Arthur W Toga, et al. Surface mapping brain function on 3d models. Computer Graphics and Applications, IEEE, 10(5):33-41, 1990.

Written by lewellen

2015-07-02 at 5:19 pm

Posted in Computer Graphics

## Deep Learning for Automatic Speech Recognition

### Introduction

The problem of automatic speech recognition, and details of the traditional Hidden Markov Model and Gaussian Mixture Model hybrid architecture (HMM-GMM) for acoustic modeling are detailed in [JM08], but will be skipped here. Instead, the focus of this literature review is to discuss how [DYDA12] uses a context dependent Hidden Markov Model and Deep Neural Network hybrid architecture (CD-HMM-GMM) for acoustic modeling as it represents a significant improvement over the traditional HMM-GMM approach. This review will begin with motivation for the architecture, then go into detail the algorithms used for pre-training, and outline the algorithms used for training before concluding with how well the approach outperforms the standard HMM-GMM approach.

#### Architecture

To motivate their architecture, [DYDA12] rely on the standard noisy channel model for speech recognition presented in [JM08] where we wish to maximize the likelihood of a decoded word sequence given our input audio observations:

 $\displaystyle \hat{w} = \underset{w \in L}{\text{argmax }} \mathbb{P} \left( w \lvert x \right ) = \underset{w \in L}{\text{argmax }} \mathbb{P} \left( x \lvert w \right ) \mathbb{P} \left( w \right )$ (1)

Where $\mathbb{P} \left( w \right )$ and $\mathbb{P} \left( x \lvert w \right )$ represent the language and acoustic models respectively. [JM08] state that the language model can be computed via an N-gram model; [DYDA12] acknowledge using this approach, but focus their efforts into explaining their acoustic model:

 $\displaystyle \mathbb{P} \left( x \lvert w \right ) = \sum_{q} \mathbb{P} \left( x, q \lvert w \right ) \mathbb{P} \left( q \lvert w \right ) \approxeq \max \pi(q_0) \prod_{t = 1}^T a_{q_{t-1} q_t} \prod_{t=0}^T \mathbb{P} \left( x_t \lvert q_t \right )$ (2)

Here the acoustic model is viewed as a sequence of transitions between states of tied-state triphones which [DYDA12] refer to as senones giving us the context dependent aspect of the architecture. [FLMS14] explains that senones represent the pronunciation of words and are derived by decision trees. By tying triphone states together, this approach is able to avoid having to process a large number of triphones and avoid the likely sparseness of training examples for every possible triphone.

The model assumes that there is a probability $\pi(q_0)$ for the starting state, probabilities $a_{q_{t-1} q_{t}}$ of transitioning to the state observed at step $t -1$ to step $t$, and finally, the probability of the acoustics given the current state $q_t$. [DYDA12] expand this last term further into:

 $\displaystyle \mathbb{P} \left( x_t \lvert q_t \right ) = \frac{\mathbb{P} \left( q_t \lvert x_t \right ) \mathbb{P} \left( x_t \right ) }{\mathbb{P} \left( q_t \right ) }$ (3)

Where $\mathbb{P} \left( x_t \lvert q_t \right )$ models the tied triphone senone posterior given mel-frequency cepstral coefficients (MFCCs) based on 11 sampled frames of audio. While MFCCs come from signal processing, they have proven to be effective features for automatic speech recognition. Based on the power spectrum derived from sample audio frames, MFCCs represent characteristics of the audio that our ears are sensitive to as explained in [Ada10]. $\mathbb{P} \left( q_t \right )$ is the prior probability of the senone, and $\mathbb{P} \left( x_t \right )$ can be ignored since it does not vary based on the decoded word sequence we are trying to find.

Based on this formalism, [DYDA12] chose to use a pre-trained Deep Neural Network to estimate $\mathbb{P} \left( q_t \lvert x_t \right )$ using MFCCs as DNN inputs and taking the senone posterior probabilities as DNN outputs. The transitioning between events is best modeled by a Hidden Markov Model whose notation, $\pi, a, \text{and } q$ appears in Eq. (2). Now that we have an overview of the general CD-DNN-HMM architecture, we can look at how [DYDA12] train their model.

#### Pre-Training

Given the DNN model we wish to fit the parameters of the model to a training set. This is usually accomplished by minimizing a likelihood function and deploying a gradient descent procedure to update the weights. One complication to this approach is that the likelihood can be computationally expensive for multilayer networks with many nodes rendering the approach unusable. As an alternative, one can attempt to optimize a computationally tractable surrogate to the likelihood. In this case the surrogate is the contrastive divergence method developed by [Hin02]. This sidestep enabled [HOT06] to develop an efficient unsupervised greedy pre-training process whose results can then be refined using a few iterations of the traditional supervised backpropagation approach. In this portion of the paper we discuss the work of [Hin02] and explain the greedy algorithm of [HOT06] before going on to discuss the high-level training procedure of [DYDA12].

To understand the pre-training process, it is necessary to discuss the Restricted Boltzmann Machine (RBM) and Deep Belief Network (DBN) models. RBMs are an undirected bipartite graphical model with Gaussian distributed input nodes in a visible layer connecting to binary nodes in a hidden layer. Every possible arrangement of hidden, $h$, and visible, $v$, nodes is given an energy under the RBM model:

 $\displaystyle E(v, h) = - b^T v - c^T h - v^T W h$ (4)

Where $W$ is the weight of connections between nodes and vectors $b$ and $c$ correspond to the visible and hidden biases respectively. The resulting probability is then given by:

 $\displaystyle \mathbb{P} \left( v, h \right ) = \frac{e^{-E(v, h)}}{Z}$ (5)

Where $Z$ is a normalization factor. Based on the assumptions of the RBM, [DYDA12] derive expressions for $\mathbb{P} \left( h = 1 \lvert v \right )$ and $\mathbb{P} \left( v = 1 \lvert h \right )$ given by:

 $\displaystyle \mathbb{P} \left( h = 1 \lvert v \right ) = \sigma(c + v^T W) \qquad \mathbb{P} \left( v = 1 \lvert h \right ) = \sigma(b + h^T W^T)$ (6)

Where $\sigma$ is an element-wise logistic function. [DYDA12] argue that Eq. (6) allows one to repurpose the RBM parameters to initialize a neural network. Training of the RBM is done by stochastic gradient descent against the negative log likelihood since we wish to find a stable energy configuration for the model:

 $\displaystyle - \frac{\partial \ell(\theta)}{\partial w_{ij}} = \langle v_i h_j \rangle_\text{data} - \langle v_i h_j \rangle_\text{model}$ (7)

however [DYDA12] point out that the gradient of the negative log likelihood cannot be computed exactly since the $\langle \cdot \rangle_\text{model}$ term takes exponential time. As a result, the contrastive divergence method is used to approximate the derivative:

 $\displaystyle - \frac{\partial \ell(\theta)}{\partial w_{ij}} = \langle v_i h_j \rangle_\text{data} - \langle v_i h_j \rangle_\text{1}$ (8)

where $\langle \cdot \rangle_\text{1}$ is a single step Gibbs sampled expectation. These terms are expectations in which nodes $i \text{ and } j$ are simultaneously active given the training data and model. Given this insight, regular stochastic gradient descent can be performed and the parameters of a RBM fitted to training data.

Now that we have an understanding of RBMs, we can shift our focus to DBNs. A Deep Belief Network is a multilayer model with undirected connections between the top two layers and directed between other layers. To train these models, [HOT06] had the insight to treat adjacent layers of nodes as RBMs. One starts with the bottom two layers and trains them as though they were a single RBM. Once those two layers are trained, then the top layer of the RBM is treated as the input layer of a new RBM with the layer above that layer acting as the hidden layer of the new RBM. The sliding window over the layers continues until the full DBN is trained. After this, [HOT06] describe an “up-down” algorithm to further refine the learned weights. The learned parameters of this greedy approach can then be used as the parameters of a DNN as explained earlier in the discussion of Eq. (6).

#### Training

Training of the CD-DNN-HMM model consists of roughly a dozen involved steps. We won’t elaborate here on the full details of each step, but will instead provide a high-level sketch of the procedure to convey its general mechanics.

The first high-level step of the procedure is to initialize the CD-DNN-HMM model. This is done by first training a decision tree to find the best tying of triphone states which are then used to train a CD-GMM-HMM system. Next, the unique tied state triphones are each assigned a unique senone identifier. This mapping will then be used to label each of the tied state triphones. (These identifiers will be used later to refine the DNN.) Finally, the trained CD-GMM-HMM is converted into a CD-DNN-HMM by retaining the triphone and senone structure and HMM parameters. This resulting DNN goes through the previously discussed pre-training procedure.

The next high-level step iteratively refines the CD-DNN-HMM. To do this, first the originally trained CD-GMM-HMM model is used to generate a raw alignment of states which is then mapped to its corresponding senone identifier. This resulting alignment is then used to refine the DBN by backpropagation. Next, the prior senone probability is estimated based on the number of frames paired with the senone and the total number of frames. These estimates are then used to refine the HMM transition probabilities to maximize the features. Finally, if this newly estimated parameters do not improve accuracy against a development set, then the training procedure terminates; otherwise, the procedure repeats this high-level step.

### Experimental Results

#### System Configurations

[DYDA12] report that their system relies on nationwide language model consisting of 1.5 million trigrams. For their acoustic model, they use a five hidden layer DNN with each layer containing 2,048 hidden units. Training the system from scratch on 24 hours of training data takes four days on a Dell T3500 workstation with an NVIDIA Tesla GPU. [DYDA12] emphasize the importance of the GPU in obtaining acceptable training time, and that without it, training time would be 30x slower.

#### Datasets and Metrics

Comparison of automatic speech recognition system consists of three principle error metrics: sentence (SER), word (WER), and phoneme (PER) error rates. These look at the ratio of incorrect entities to the number of total entities with the exception of word error rate which uses a Levenshtein approach to measure the number of insertions, substitutions, and deletions relative to the total number of words. A sentence is considered incorrect if there is at least one incorrect word.

These error metrics often coincide with different datasets, in particular WER is reported for Switchboard, SER for Bing Mobile Voice Search (BMVS), and PER on TIMIT. Switchboard is a collection of phone conversations between two people, while BMVS is a collection of short spoken questions such as “The Med” or “Chautauqua Park” that are used to find these locations, while TIMIT is a phonetic focused corpus of spoken sentences that are phonetically rich.

#### Results

 Switchboard BMVS TIMIT (WER) (SER) (PER) GMM 23.6[2] 36.2[1] 21.7[2] DNN 16.1[2] 30.4[1] 21.9[3] CNN – – 20.2[3] RNN – – 17.7[4]

Direct comparison of models is complicated by the variety of error metrics and datasets; [DBL12] is used to fill in these gaps to make a meaningful comparison. As one can see from Table (1), the neural network approaches do better on average over the traditional GMM approach. To illustrate that it is not only DNN approaches that do better, the work of [AMJ+14] using a Convolutional Neural Network (CNN) and [GMH13] using a Recurrent Neural Network (RNN) are included to further drive the point that neural network architectures are viable alternatives to GMMs.

### Conclusions

[DYDA12], [AMJ+14], and [GMH13] have shown that neural network architectures exhibit better performance over Gaussian Mixture Models. [DYDA12] believes that a more capable first layer model provided by mean-covariance restricted Boltzmann machines will increase performance, while [AMJ+14] plans to investigate unexpected improvements in large-vocabulary speech recognition where they were absent in phone recognition tasks when using convolutional restricted Boltzmann machines. Both routes seem promising and are likely to produce improved error rates inline with [GMH13]’s results.

In [DBL12], the authors of both research groups suggest key gains will come from improved understanding of the pre-training process and how the types of units used in these models affect error rates. They conclude that distributed training is the largest hurdle to overcome for these systems to make use of more training data. (Parallelization is limited by the sequential stochastic gradient descent at the heart of the pre-training and training processes.) As [DYDA12] point out in their paper, GPU-based approaches can assist in reducing computation time, but more foundational approaches need to be pursued.

In a 2014 talk [Hin14], Hinton criticizes existing neural network architectures on philosophical grounds arguing that they do not correspond well enough to how the brain functions citing inadequate structural complexity. His proposed solution is a new neural network approach that clusters neurons together into capsules, which he believes will better model how the cortical columns of the brain behave. If Hinton is right (which his track record suggests), then it is likely we’ll see this capsule approach outperform existing models, and consequently, yield improved error rates in automatic speech recognition.

### References

[Ada10] Andre Gustavo Adami. Automatic speech recognition: From the beginning to the portuguese language. In The Int. Conf. on Computational Processing of Portuguese (PROPOR). Rio Grande do Sul: Porto Alegre, 2010.

[AMJ+14] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Ui. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech & Language Processing, 22(10):1533-1545, 2014.

[DBL12] Deep neural networks for acoustic modeling in speech recognition: The shared views of four research grounds. IEEE Signal Process. Mag., 29(6):82-97, 2012.

[DYDA12] George E. Dahl, Dong Ui, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, speech & Language Processing, 20(1):30-42, 2012.

[FLMS14] Luciana Ferrer, Yun Lei, Mitchell McLaren, and Nicolas Scheffer. Spoken language recognition based on senone posteriors. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communnication Association, Singapore, September 14-18, 2014, pages 2150-2154. ISCA, 2014.

[GMH13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton, Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 6645-6649, 2013.

[Hin02] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771-1800, 2002.

[Hin14] Geoffrey E. Hinton. What’s wrong with convolutional nets? Massachusetts Institute of Technology, Department of Brain and Cognitive Sciences, Fall Colloquium Series, 2014.

[HOT06] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527-1554, 2006.

[JM08] Daniel Jurafsky and James H. Martin. Speech and Language Processing, 2nd Edition. Prentice Hall, 2008.

Written by lewellen

2015-06-01 at 8:00 am

## Large-Scale Detection and Tracking of Aircraft from Satellite Images

Abstract
In this paper a distributed system for detecting and tracking commercial aircraft from medium resolution synthetic satellite images is presented. Discussion consisting of the system’s Apache Spark distributed architecture, correlation-based template matching, heuristic-based tracking, and evaluation of the system’s Standalone, Cluster, and Cloud modes of operation are covered before concluding with future work. Experimental results support 10 m detection accuracy, 96% detection rate, and 200 ms/image and 10 mb/sec Cloud processing performance. The goal of this work is to demonstrate that a satellite-based aircraft tracking system is feasible and that the system’s oversimplifying assumptions offer a baseline which similar real-world systems may be compared. Applications of this system include real-time tracking of commercial aircraft, flight deviation, and the automatic discovery of commercial aircraft from historic imagery. To the best of the author’s knowledge, a system of this nature has not been published publicly.

## Introduction

### Motivation

Historically, search and rescue operations have relied on on-site volunteers to expedite the recovery of missing aircraft. However, on-site volunteers are not always an option when the search area is very broad, or difficult to access or traverse as was the case when Malaysia Airlines Flight 370 disappeared over the Indian Ocean in March, 2014. Crowd-sourcing services such as tomnod.com have stepped up to the challenge by offering volunteers to manually inspect satellite images for missing aircraft in an attempt to expedite the recovery process. This process is effective, but slow. Given the urgency of these events, an automated image processing solution is needed.

### Prior Work

The idea of detecting and tracking objects from satellite images is not new. There is plenty of academic literature on detection or tracking, but often not both for things like oil tankers, aircraft, and even clouds. Most distributed image processing literature is focused on using grid, cloud, or specialized hardware for processing large streams of image data using specialized software or general platforms like Hadoop. Commercially, companies like DigitalGlobe have lots of satellite data, but haven’t publicized their processing frameworks. Start-ups like Planet Labs have computing and satellite resources capable of processing and providing whole earth coverage, but have not publicized any work on this problem.

The FAA has mandated a more down to earth approach through ADS-B. By 2020, all aircraft flying above 10,000 ft will be required to have Automatic dependent surveillance broadcast transceivers in order to broadcast their location, bearing, speed and identifying information so that they can be easily tracked. Adoption of the standard is increasing with sites like Flightrader24.com and Flightaware.org providing real-time access to ongoing flights.

### Problem Statement

Fundamentally, there are two variants of this problem: online and offline. The offline variant most closely resembles the tomnod.com paradigm which is concerned with processing historic satellite images. The online variant most closely resembles air traffic control systems where we’d be processing a continual stream of images coming from a satellite constellation providing whole earth coverage. The focus of this work is on the online variant since it presents a series of more interesting problems (of which the offline problem can be seen as a sub-problem.)

In both cases, we’d need to be able to process large volumes of satellite images. One complication is that large volumes of satellite images are difficult and expensive to come by. To work around this limitation, synthetic images will be generated based off OpenFlights.org data with the intent of evaluating natural images on the system in the future. To generate data we’ll pretend there are enough satellites in orbit to provide whole earth coverage capturing simulated flights over a fixed region of space and window of time. The following video is an example of the approximately 60,000 flights in the dataset being simulated to completion:

To detect all these aircraft from a world-wide image, we’ll use correlation based template matching. There are many ways to parallelize and distribute this operation, but an intuitive distributed processing of image patches will be done with each cluster node performing a parallelized Fast Fourier Transform to identify any aircraft in a given patch. Tracking will be done using an online heuristic algorithm to “connect the dots” recovered from detection. At the end of the simulation, these trails of dots will be paired with simulated routes to evaluate how well the system did.

The remainder of this post will cover the architecture of the system based on Apache Spark, its configurations for running locally and on Amazon Web Services, and how well it performs before concluding with possible future work and cost analysis of what it would take to turn this into a real-world system.

## System Architecture

### Overview

The system relies on the data pipeline architecture presented above. Data is read in from the OpenFlights.org dataset consisting of airport information and flight routes. A fixed number of national flights are selected and passed along to a simulation module. At each time step, the simulator identifies which airplanes to launch, update latitude and longitude coordinates, and remove those that have arrived at their destination.

To minimize the amount of network traffic being exchanged between nodes, flights are placed into buckets based on their current latitude and longitude. Buckets having flights are then processed in parallel by the Spark Workers. Each worker receives a bucket and generates a synthetic satellite image; this image is then given to the detection module which in turn recovers the coordinates from the image.

These coordinates are coalesced at the Spark Driver and passed along to the tracking module where the coordinates are appended to previously grown flight trails. Once 24 hours worth of simulated time has elapsed (in simulated 15 minute increments), the resulting tracking information is passed along to a reporting module which matches the simulated routes with the flight trails. These results are then visually inspected for quality.

### Simulation Assumptions

All latitude and longitude calculations are done under the Equirectangular projection. A corresponding flight exists for each route in the OpenFlights.org dataset (Open Database License). Flights departing hourly follow a straight line trajectory between destinations. Once en route, flights are assumed to be Boeing 747s traveling at altitude of 35,000 ft with a cruising speed of 575 mph.

### Generation

Flights are mapped to one of $4000 \times 8000$ buckets based on each flight’s latitude and longitude coordinate. Each bucket spans a $0.045 \times 0.045$ degree region as illustrated in the middle layer of Fig. (2). Given a bucket, a $512 \times 512$ oversimplified synthetic medium-resolution monochromatic satellite image is created with adorning aircraft silhouettes for each $71 \times 65$ m Boeing 747 airliner in the bucket. (Visual obstructions such as clouds or nightfall will not be depicted.) This image, in addition to the latitude and longitude of the top-left and bottom-right of the image, are then passed along to the detection module.

### Detection

Given an image and its world coordinate frame, the detection module performs textbook Fourier-based correlation template matching to identify silhouettes of airplanes, $X$, in the image, $Y$:

 $\displaystyle Z = \mathcal{F}^{-} \left( \mathcal{F}(X) \circ \mathcal{F}(Y) \right )$ (1)

Where the two-dimensional Discrete Fourier Transform and inverse transform are defined as:

 $\displaystyle \mathcal{F}(f)(u,v) = \frac{1}{M N} \sum_{m = 0}^{M - 1} \sum_{n = 0}^{N - 1} f(m, n) \exp{ \left( -2 \pi i \left( \frac{mu}{M} + \frac{nv}{N} \right ) \right ) }$ (2)

 $\displaystyle \mathcal{F^{-}}(F)(m,n) = \sum_{u = 0}^{M - 1} \sum_{v = 0}^{N - 1} F(u, v) \exp{ \left( 2 \pi i \left( \frac{mu}{M} + \frac{nv}{N} \right ) \right ) }$ (3)

To carry out these calculations efficiently, a parallelized two-dimensional Fast Fourier Transform (FFT) $\mathcal{O}(N \log N)$ time algorithm was implemented for both forward and inverse operations by leveraging the fact that operations (2), (3) can be factored so that the FFT can be computed row-wise, and then on those results column-wise. Hadamard (element-wise) product of the frequency domain representation of the airplane and satellite image is done naively in quadratic time.

To denoise the results, the recovered spatial product, $Z$, is thresholded so that any real values greater than 90% of the product’s real maximum, $Z^*$, are kept:

 $\displaystyle Z_{x,y} = \Re{ \left( Z_{x, y} \right ) } 1_{ [ \Re{ \left( Z_{x, y} \right ) } \ge 0.9 Z^{*} ] }$ (4)

Since there are many values greater than the threshold, a linear time (in number of nodes) connected component labeling algorithm is applied to identify the most likely aircraft locations. The algorithm treats each pixel of the image as a node if that pixel’s value is greater than the threshold. An edge exists between two nodes if the nodes’ pixel coordinates are within an $\ell_\infty$ distance of two. The centroid of each connected component is then taken to be the true coordinate of each detected aircraft. Next only those centroids derived from clusters having more than half the average number of pixels per cluster are kept. Finally, these centroids are transformed to latitude and longitude coordinates given the world coordinate frame.

### Tracking

The tracking module uses an $181 \times 361$ grid of buckets with each bucket representing approximately a square degree region as illustrated as the top layer in Fig. (2). Each individual bucket consists of a stack of sightings where a sighting is a timestamped collection of coordinates. Here an individual coordinate is allowed to “connect” up to two other coordinates. Coordinates connected in this fashion form a trail, which is the primary output of the module.

For each latitude and longitude coordinate from the detection module, $d \in D$, the tracking module picks all the previous time step’s coordinates, $p \in P$, from the neighboring ($\ell_\infty \le 5$) buckets of $d$‘s bucket. From $P$, only those coordinates that satisfy the following criteria are considered:

• $p$ must be free to “connect” to another coordinate.
• $d$ must be collinear to the coordinates of the trail headed by $p$, i.e., $\lvert AC - AB - BC \rvert \le \epsilon$ as in Fig. (3).
• Given the predecessor of $p$, the inner product of the vectors formed from the predecessor to $p$ and $d$ must be positive, i.e., $\langle \overrightarrow{AB}, \overrightarrow{AC} \rangle > 0$ as in Fig. (3).

Next, the nearest neighbor of $p$ is chosen from this remaining set of points. If a nearest neighbor exists, then $p$ is appended to the end of the nearest neighbor’s trail, otherwise a new trail is created. Finally, $p$ is added to its designated bucket so that it can be used for future trail building.

When the simulation completes, all trails from tracking module are analyzed and matched to the known routes used in the simulation. Matching is done by minimizing the distance from the trail’s origin and destination to a route’s origin and destination respectively. If the mean orthogonal distance:

 $\displaystyle MOD(x) = \frac{1}{N} \sum_{i} \frac{ \lvert \langle w, x_i \rangle + b \rvert }{ \lVert w \rVert }$ (5)

from the coordinates in the trail to the line formed by connecting the route’s origin and destination is greater than 25 m, then the match is rejected.

### Reporting

The reporting module is responsible for summarizing the system’s performance. The average mean orthogonal distance given by Eqn. (5) is reported for all identified routes, total number of images processed and coordinates detected, and the portion of routes correctly matched is reported.

## System Configurations

### Standalone

Standalone mode runs the application in a single JVM without using Spark. Experiments were ran on the quad-core Intel i7 3630QM laptop jaws, which has 8 GB of memory, 500 GB hard drive, and is running Windows 8.1 with Java SE 7.

### Cluster

Cluster mode runs the application on a Spark cluster in standalone mode. Experiments were ran on a network consisting of two laptop computers connected to a private 802.11n wireless network. In addition to jaws, the laptop oddjob was used. oddjob is a quad-core Intel i7 2630QM laptop with 6 GB of memory, 500 GB hard drive running Windows 7. Atop each machine, Oracle VM VirtualBox hosts two cloned Ubuntu 14.04 guest operating systems. Each virtual machine has two cores, 2 GB of memory and a 8 GB hard drive. Each virtual machine connects to the network using a bridged network adapter to its host’s. Host and guest operating systems are running Java SE 7, Apache Hadoop 2.6, and Scala 2.11.6 as prerequisites for Apache Spark 1.3.1. In total, there are four Spark Workers who report to a single Spark Master which is driven by a single Spark Driver.

### Cloud

Cloud mode runs the application on an Amazon Web Services (AWS) provisioned Spark cluster managed by Apache Yarn. Experiments were ran using AWS’s Elastic Map Reduce (EMR) platform to provision the maximum allowable twenty[1] Elastic Compute Cloud (EC2) previous generation m1.medium instances (one master, nineteen core) by scheduling jobs to execute the application JARs from the Simple Storage Service (S3). Each m1.medium instance consists of one 2 GHz CPU, 3.7 GB of memory, 3.9 GB hard drive running Amazon Machine Image (AMI) 3.6 equipped with Red Hat 4.8, Java SE 7, Spark 1.3.0. In total, there are nineteen Spark Workers and one Spark Master – one per virtual machine – managed by a Yarn Resource Manager driven by a single Yarn Client hosting the application.

## System Evaluation

### Detection Rate

 $\displaystyle D_R = \frac{\# \left( \text{Detected coordinates} \right)}{\# \left( \text{Expected coordinates} \right)}$ (6)

When an image is sparsely populated, the system consistently detects the presence of the aircraft, however as the density increases, the system is less adapt at finding all flights in the region as shown in Fig. (7). This is unexpected behavior. Explanations include the possibility that the threshold needs to be made to be adaptive, or that a different approach needs to be taken all together. In terms of real world implications, FAA regulations (JO 7110.65V 5-4-4) state that flights must maintain a minimum lateral distance of 3 and 5 miles (4.8 to 8 km). In principle, there could be at most four flights in a given image under these guidelines and the system would still have a 96.6% chance of identifying all four positions.

### Detection Accuracy

A flight from DIA to CTL was simulated to measure how accurate the template matching approach works as illustrated in Fig. (8). Two errors were measured: the mean orthogonal distance given by Eqn. (5) and the mean distance between the detected and actual coordinate for all time steps:

 $\displaystyle MD(x, y) = \frac{1}{N} \sum_i \lVert x_i - y_i \rVert_2$ (7)

For Eqn. (5) a mean error of $9.99 \pm 3.2$ m was found, and for Eqn. (7) $19.95 \pm 3.3$ m. Both errors are acceptable given a single pixel represents approximately $10$ m. (For context, the global positioning system (GPS) has a 7.8 m error.)

In terms of how accurate detection is at the macro level, 500 flights were simulated and the resulting mean orthogonal distance was analyzed. Fig. (9) illustrates the bimodal distribution that was observed. 65% of the flights observed an accuracy less than 26 m with an average accuracy of $14.2 \pm 5.8$ m, while the remaining 35% saw an average accuracy of $111.9 \pm 100$ km which is effectively being off by a full degree. It is assumed that this 35% are cases where trails are paired incorrectly with routes. Based on these findings, the system enforces that all pairings not exceed a mean orthogonal distance of 25 m.

### Tracking Rate

 $\displaystyle T_R = \frac{\# \left( \text{Correctly paired trails} \right)}{\# \left(\text{ Expected trails} \right)}$ (8)

For Fig. (10), an increasing number of random flights were simulated to completion and the resulting mean tracking rate reported. Based on these findings, the tracking module is having difficulty correctly handling many concurrent flights originating from different airports. This behavior is likely a byproduct of how quickly the detection rate degrades when many flights occupy a single region of space. When running a similar simulation where all flights originate from the same airport, the tracking rate is consistently near-perfect independent of the number of flights. This would suggest the system has difficulty with flights that cross paths. (For context, there is on average 7,000 concurrent flights over US airspace at any given time that we would wish to track.)

### Performance

A series of experiments was conducted against the three configurations to measure how quickly the system could process different volumes of flights across the United States over a 24-hours period. The results are illustrated in Fig. (11). Unsurprisingly, the Cloud mode outperforms both the Standalone and Cluster modes by a considerable factor as the number of flights increases.

Configuration ms/image mb/sec Time (min)
Standalone 704 3.00 260
Cluster 670 2.84 222
Cloud 207 9.67 76
Table 1: Processing time per image, megabytes of image data processed per second, and overall processing time for 22k images by system configuration.

Table (1) lists the overall processing time for 22k images representing roughly 550k km2, and 43 GB of image data. If the Cloud configuration was used to monitor the entire United States, then it would need approximately 22 hours to process a single snapshot consisting of 770 GB of image data. Obviously, the processing time is inadequate to keep up with a recurring avalanche of data every fifteen minutes. To do so, a real-world system would need to be capable of processing an image every 2 ms. To achieve this 1) more instances could be added, 2) the implementation can be refined to be more efficient, 3) the implementation can leverage GPUs for detection, and 4) a custom tailored alternative to Spark could be used.

## Discussion

### Future Work

There are many opportunities to exchange the underlying modules with more robust techniques that both scale and are able to handle real-world satellite images. The intake and generation modules can be adapted to either generate more realistic flight paths and resulting satellite imagery, or adapted to handle real-world satellite imagery from a vendor such as Skybox Imaging, Planet Labs, or DigitalGlobe.

For detection, the correlation based approach can be replaced with a cross-correlation approach, or with the more involved Scale Invariant Feature Transformation (SIFT) method which would be more robust at handling aircraft of different sizes and orientations. Alternatively, the parallelism granularity can be changed so that the two-dimensional FFT row-wise and column-wise operations are distributed over the cluster permitting the processing of larger images.

Tracking remains an open issue for this system. Getting the detection rate to be near perfect will go a long way, but the age of historical sightings considered could be increased to account for “gaps” in the detection trail. Yilmaz et al. provide an exhaustive survey of deterministic and statistical point tracking methods that can be applied here, in particular the Joint Probability Data Association Filter (JPDAF) and Multiple Hypothesis Tracking (MHT) methods which are worth exploring further.

On the reporting end of the system, a dashboard showing all of the detected trails and coordinates would provide an accessible user interface to end-users to view and analyze flight trails, discover last known locations, and detect anomalies.

### Real-world Feasibility

While the scope of this work has focused on system internals, it is important to recognize that a real-world system requires a supporting infrastructure of satellites, ground stations, computing resources, facilities and staff- each of which imposes its own set of limitations on the system. To evaluate the system’s feasibility, its expected cost is compared to the expected cost of the ADS-B approach.

Following the CubeSat model and a 1970 study by J. G. Walker, 25 satellites ($1M ea.) forming a constellation in low earth orbit is needed to provide continuous whole earth coverage for$25M. Ground stations ($120k ea.) can communicate with a satellite at a time bringing total costs to$50M.[2] Assuming that a single computer is responsible for square degree region, the system will require 64,800 virtual machines, equivalently 1,440 quad-core servers ($1k ea.) bringing the running total to$51M.

ADS-B costs are handled by aircraft owners. Average upgrade costs are $5k with prices varying by vendor and aircraft. Airports already have Universal Access Transceivers (UATs) to receive the ADS-B signals. FAA statistics list approximately 200,000 registered aircraft suggesting total cost of$1B.

Given that these are very rough estimates, an unobtrusive $51M system would be a good alternative to a$1B dollar exchange between private owners to ADS-B vendors. (Operational costs of the system were estimated to be \$1.7M/year based on market rates for co-locations and staff salaries.)

### Conclusions

In this work, a distributed system has been presented to detect and track commercial aircraft from synthetic satellite imagery. The system’s accuracy and detection rates are acceptable given established technologies. Given suitable hardware resources, it would be an effective tool in assisting search-and-rescue teams locate airplanes from historic satellite images. More work needs to be done to improve the system’s tracking abilities for it to be a viable real-world air traffic control system. While not implemented, the data needed to support flight deviation, flight collision detection and other air traffic control functionality is readily available. Spark is an effective tool for quickly distributing work to a cluster, but more specialized high performance computing approaches may yield better runtime performance.

## References

[1] Automatic dependent surveillance-broadcast (ads-b) out equipment and use. Technical Report 14 CFR 91.225, U.S. Department of Transportation Federal Aviation Administration, May 2010.

[2] General aviation and air taxi activity survey. Technical Report Table 1.2, U.S. Department of Transportation Federal Aviation Administration, Sep 2014.

[3] Nanosats are go! The Economist, June 7 2014.

[4] A. Eisenberg. Microsatellites: What big eyes they have. New York Times, August 10 2013.

[5] A. Fasih. Machine vision. Lecture given at Transportation Informatics Group, ALPEN-ADRIA University of Klagenfurt, 2009.

[6] S. Kang, K. Kim, and K. Lee. Tablet application for satellite image processing on cloud computing platform In Geoscience and Remote Sensing Symposium (IGARSS), 2013 IEEE International, pages 1710-1712, July 2013.

[7] W. Lee, Y. Choi, K. Shon, and J. Kim. Fast distributed and parallel pre-processing on massive satellite data using grid computing. Journal of Central South University, 21(10):3850-3855, 2014.

[8] J. Lewis. Fast normalized cross-correlation. In Vision interface, volume 10, pages 120-123, 1995.

[9] W. Li, S. Xiang, H. Wang, and C. Pan. Robust airplane detection in satellite images. In Image Processing (ICIP), 2011 18th IEEE International conference on, pages 2821-2824, Sept 2011.

[10] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91-110, 2004.

[11] H. Prajapati and S. Vij. Analytical study of parallel and distributed image processing. In Image Information Processing (ICIIP), 2011 International Conference on, pages 1-6, Nov 2011.

[12] E. L. Ray. Air traffic organization policy. Technical Report Order JO 7110.65V, U.S. Department of Transportation Federal Aviation Administration, April 2014.

[13] J. Tunaley. Algorithms for ship detection and tracking using satellite imagery. In Geoscience and Remote Sensing Symposium, 2004. IGARSS ’04. Proceedings. 2004 IEEE International, volume 3, pages 1804-1807, Sept 2004.

[14] J. G. Walker. Circular orbit patterns providing continuous whole earth coverage. Technical report, DTIC Document, 1970.

[15] Y. Yan and L. Huang. Large-scale image processing research cloud. In CLOUD COMPUTING 2014, The Fifth International Conference on Cloud Computing, GRIDs, and Virtualization, pages 88-93, 2014.

[16] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey. Acm computing surveys (CSUR), 38(4):13, 2006.

Written by lewellen

2015-05-20 at 8:00 am