A Greedy Approximation Algorithm for the Linear Assignment Problem
Starting today, I will be posting some of the related source code for articles on GitHub.
Introduction
The Linear Assignment Problem (LAP) is concerned with uniquely matching an equal number of workers to tasks, , such that the overall cost of the pairings is minimized. A polynomial time algorithm was developed in the late fifties by [6], and further refined by [9], called the Hungarian method. Named so after the work of Hungarian mathematicians König and Egerváry whose theorems in the 1930s form the basis for the method. While the Hungarian Method can solve LAP instances in time, we wish to find faster algorithms even if it means sacrificing optimality in the process. Here we examine a greedy approximation algorithm with runtime in terms of its approximation factor and compare it empirically to the Hungarian method.
Linear Assignment Problem
The above linear program has cost, , and assignment, , matrices that specify the terms of the LAP. This is equivalent to finding a perfect matching in a weighted bipartite graph. A minimal cost may have several possible assignments, but we are only interested in finding just one. It is assumed that no one worker can do all jobs more efficiently by themselves than the distributing work across all workers. Likewise, if the costs are thought of as durations, then the minimum cost is the minimum sequential rather than parallel time taken to complete the tasks.
From a practical point of view, we may relax the integral constraint on and allow all positive realvalued costs. For instances where there are more jobs than workers, and vice versa, dummy entries valued greater than the existing maximum may be added. Minimizing the cost is the default objective, but the maximum cost can be found by finding the optimal assignment for , then finding the cost relative to .
Algorithms
Brute Force Rather than using the mathematical programming or graph theoretic representation of the problem, we can instead view the problem as finding the assignment that minimizes the cost out of all possible assignments:
There are such assignments which can be produced using an iterative version of Heap’s algorithm [5] in time assuming one does differential scoring (opposed to calculating the score for each permutation which would result in an algorithm.)
Random The random algorithm selects a permutation uniformly from the set of all possible assignment permutations in time using the FisherYates shuffle [4]. This obviously does not produce an optimal or nearoptimal solution, but serves as a straw man to compare other results.
Greedy The greedy heuristic continues to cover the row and column of the smallest uncovered entry in the cost matrix until all entries are covered. The resulting set of entries then constitutes the assignment of workers to jobs. An inefficient algorithm can be used to find the smallest entry every iteration, or a more efficient result of can be obtained through the use of a sorted, array indexed hybrid mesh and queue. Let represent a tuple consisting of row, column, and value; the previous entry in the matrix this value, and the next entry in this matrix this value; and the (left, above, right, below) that are adjacent to this node.
Algorithm 1 A greedy algorithm for the LAP.

 // Adjacent node left, above, right, below properties
 // Sort in ascending order by node value
 // Adjacent node previous and next properties

 // Deletes row and col of
Allocating and linking for assignment is ; mesh ; queue . Therefore, initialization requires time. The body of the loop requires a constant time assignment of worker to job, and time to remove the row and column from a matrix using a modified depth first search. Thus, the loop itself accounts for time. The resulting time complexity is therefore .
Breaking ties for the minimum uncovered value can result in different costs. This drawback is shown in the above example were choosing at yields a minimum cost of , where as the one at gives a minimum cost of . The next progression in the design of the greedy algorithm would be to try all minimum positions and keep the top performing paths.
Hungarian The general idea behind the KuhnMunkres algorithm is that if we are given an initial assignment, we can make further assignments and potentially reassign workers until all workers have been tasked with a job. The highlevel sketch of the algorithm starts with an initial assignment. While we have jobs that are unassigned, we look for qualified workers, ie, the zero entries. If a worker is already assigned to a job, but is also qualified for another, then we prime the alternative and continue to the next qualified worker, but if that is the only job the worker is qualified for, then we’d like to reassign any other worker already tasked to that job. This leads to a natural ripple effect represented by an alternating path of starred and primed entries. In Munkres’ paper [9] “starred” zero’s represent assignments of workers to jobs, and “primed” zero’s are alternative assignments. By flipping the bits of the path, we reassign workers to their alternative tasks while ensuring the assignment continues to be minimal by construction. After assigning as many workers as we have to, we then deduct the lowest cost to create a new qualified worker. Thus, every iteration we are guaranteed to make positive progress towards our goal of finding an optimal assignment. This scheme results in the worst case time to complete.
Algorithm 2 The Hungarian method for the LAP.

 Star the first uncovered zero in row , cover the corresponding column for
 All columns not covered
 Uncovered zeros
 Prime the current uncovered zero
 There’s a starred zero in this row
 Uncover the starred zero’s column and cover the row

 Find an alternating augmented path from the primed zero
 Unstar the starred zeros on the path and star the primed zeros on the path
 Remove all the prime markings and cover all stared zeros
 Found path

 over all uncovered
 for all uncovered columns
 for all covered rows
 Uncovered zeros
 Starred zeros // These are all the assignments
To further illustrate the algorithm, consider the following example where starred entries are denoted by red, and primed entries by green:
Analysis
The prevailing convention in the literature is to look at the approximation factor, , to determine how close the results of an approximation algorithm are to optimal [10]. Here this ratio is the expected minimum cost assignment of the algorithm under test to the same quantity given by the expected minimum assignment cost. Let be an a standard exponential random cost matrix. We resort to the exponential distribution for its ease of analyis and prominence in related literature. Cf. the works of [7], [8] for analysis based on .
Exponential Distribution Properties Let have cumulative distribution function and expectation . The distribution demonstrates the memoryless property for expectations . Define the order statistic to be the minimum of draws from . [2] with expectation . If then with expectation .
Expected Minimum Cost The expected minimum assignment cost for is given by [1]:
Which is the generalized harmonic number of order two and converges to . For the generalized harmonic numbers, , for .
Greedy The minimum value of an matrix is given by the order statistic with expectation . The expected value of the minimum cost assignment is not just because the expectation doesn’t take into account the previous iteration’s minimum value. To accomplish this we make use of the memoryless property of the exponential distribution to observe that the expected difference in minimums between iterations is the expected minimum value given by . If we add up all these differences we get the expected minimum value of the k’th iteration; summing all these expectations then yields the expected minimum cost assignment:
This is the harmonic number of order one which does not converge. The resulting approximation factor is:
Random The random algorithm will simply select an assignment permutation, so we are just adding up distributed random variables leading to an expected cost of:
And approximation factor:
From this analysis one concludes that the greedy algorithm has an unbounded approximation factor that grows significantly slower than that of randomly selecting assignments.
Evaluation
To illustrate the preceding results, Figure 1 shows the approximation factor for the greedy algorithm implementations against the derived approximation factor. The simulated results are based on 120 standard exponentially distributed matrices for . Using the same conventions for the approximation factor, Figure 2 illustrates the runtime characteristics of the algorithms after rejecting outliers due to system fluctuations. Results obtained from source code compiled with O3 flags and ran on a Xeon E31245 v5 3.5 Ghz system with 32 GBs of 2133Mhz DDR4 RAM. The algorithms coincide with the theoretical time complexities as shown in Table 2.
Solver  MSE 

GREEDYEFFICIENT  0.002139 
GREEDYNAIVE  0.014161 
HUNGARIAN  0.232998 
Summary
Brute  Random  Greedy  Hungarian  

Complexity  
1  1 
Exact solutions can be delivered by the brute method when a handful of workers are being considered, and the Hungarian method should be considered for all other instances. Approximate solutions can be provided by the greedy algorithm with logarithmic degeneracy while providing a linear factor improvement over the Hungarian method. For inputs greater than those considered, the parallel Auction algorithm [3] is a suitable alternative and the subject of future work.
References
 Aldous, D. J. The limit in the random assignment problem. Random Structures & Algorithms 18, 4 (2001), 381418.
 Balakrishnan, N., and Rao, C. Handbook of statistics 16: Order statisticstheory and methods, 2000.
 Bertsekas, D. P. The auction algorithm: A distributed relaxation method for the assignment problem. Annals of operation research 4, 1 (1988), 105123.
 Durtenfeld, R. Algorithm 235: random permutation. Communications of the ACM 7, 7 (1964), 420.
 Heap, B. Permutations by interchanges. The Computer Journal 6, 3 (1963), 293298.
 Kuhn, H. W. The hungarian method for the assignment problem. Naval research logistics quarterly 2, 12 (1955), 83097.
 Kurtzberg, J. M. On approximation methods for the assignment problem. Journal of the ACM (JACM) 9, 4 (1962), 419439.
 Steele, M. J. Probability and statistics in the service of computer science: illustrations using the assignment problem. Communications in StatisticsTheory and Methods 19, 11 (1990), 43154329.
 Munkres, J. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5, 1 (1957), 3238.
 Williamson, D. P., and Shmoys, D. B. The design of approximation algorithms. Cambridge university press, 2011.
End of the Grad School Era
If you find a path with no obstacles, it probably doesn’t lead anywhere. ~ Frank A. Clark
After a yearandahalf of intense work, I’m happy to announce that I’ve completed my Master of Science (M.S.) in Computer Science. It has been a unique experience full of challenges and opportunities that I did not fully anticipate when I quit my job as a Senior Software Developer last year. I met a diverse group of people at different points in their lives who all shared a common goal to seek out new knowledge and improve themselves; I among them. But it also meant studying something I’m passionate about, and I’m looking forward to putting into practice my new found machine learning knowledge.
When I announced going back to school I talked about making a jump; this idea that there was something better waiting for me out there. After being in the air for the past yearandahalf, I’ve landed on the west coast where I’ll be writing software related to artificial intelligence research. It’s a departure from the life I’ve built here in Colorado, but it’s also an opportunity to continue advancing my knowledge and cultivating expertise in a unique area; it is a jump to bigger and better things that I hope will be a rewarding experience.
It’s tough to say what opportunities I’ll get to work on side projects in the coming year, but I do hope to work on some miscellaneous projects related to clustering GPS trajectories; put some time into my fitness tracking application Viderefit; and put more time into GPUbased speech recognition system I’ve been steadily working on. As always, check out the archive for old posts, and subscribe by email (see side column), RSS, or Twitter (@antimatroidthe) for the latest happenings.
Using GPUs to solve Spatial FitzHughNagumo Equation Numerically
Introduction
Modeling Action Potentials
In neurophysiology, we wish to model how electrical impulses travel along the complex structures of neurons. In particular, along the axon since it is the principle outbound channel of the neuron. Along the axon are voltage gated ion channels in which concentrations of potassium and sodium are exchanged. During depolarization, a fast influx of sodium causes the voltage to gradually increase. During repolarization, a slow outflux of potassium causes the voltage to decrease gradually. Because the potassium gates are slow to close, there is a negative voltage dip and recovery phase afterwards called hyperpolarization.
Attempts to provide a continuous time model of these phases began with work in the 1950s by Hodgkin and Huxley [HH52]. The duo formulated a fourdimensional nonlinear system of ordinary differential equations based on an electrical circuit model of giant squid axons. This landmark contribution was later simplified in the 1960s by FitzHugh [Fit61]. Here FitzHugh casted the problem in terms of Van der Pol oscillators. This reformulation allowed him to explain the qualitative behavior of the system in terms of a twodimensional excitation/relaxation statespace model. The following year Nagumo [NAY62] extended these equations spatially to model the desired action potential propagation along an axon, and demonstrated its behavior on analog computers. This spatial model will be the focus of this work, and a more comprehensive account of these developments can be found in [Kee02].
FitzHughNagumo Equation
The Spatial FitzHughNagumo equation is a twodimensional nonlinear reactiondiffusion system:

(1) 
Here represents the action potential (voltage) along the axon, and the recovery of the system. Constant is absent from the literature, and is inversely proportional to recovery time.
Mathematical Analysis
The system does not admit a general analytic solution. Mathematical analysis of the system is concerned with the existence and stability of traveling waves with [AK15] providing a thorough account of these aspects. Other analyses are concerned with the statespace model and the relationship of parameters with observed physiological behavior. In particular: selfexcitatory, impulse trains, single traveling wavefronts and impulses, doubly traveling divergent wave impulses, and nonexcitatory behavior that returns the system to a resting state. Since the FitzHughNagumo equations are well understood, more recent literature is focused on higher dimensional and stochastic variants of the system which [Tuc13] discusses in detail.
Numerical Analysis
A survey of the literature revealed several numerical approaches to solving the FitzHughNagumo equations consisting of a Finite Element method with Backward Differentiation Formulae [Ott10], and the Method of Lines [Kee02] approach. In this work, three approaches based on the Finite Difference method are investigated: an explicit scheme, an adaptive explicit scheme, and an implicit scheme; along with their associated sequential CPU bound and parallel GPU bound algorithms.
Explicit Finite Difference Scheme
To study the basic properties of the FitzHughNagumo equations an explicit scheme was devised using forward and central differences for the temporal and spatial derivatives respectively.

(2) 
Truncation errors are linear in time, , and quadratic, , in space. For stability, , implying the use of inconveniently small time steps, however in practice it seems rare to find any mention of in the literature. Each time step can be computed sequentially or in parallel in linear time, . However, there is a significant constant factor improvement delivered by parallel GPU implementations since modern GPUs allow millions of nodes to be updated in parallel giving near constant runtime performance. Further parallelization is possible by trivially distributing contiguous node sets to a combination of multiple machines and multiple GPUs.
Experiments
The explicit scheme was used to investigate the traveling wave and divergent wave behaviors of the FitzHughNagumo equations. Fig. (1) demonstrate the application of a constant impulse, , at the end of an unexcited axon, , giving rise to oscillatory behavior. Fig. (2) shows a Gaussian impulse applied initially to the center of the axon, . As the system evolves, the impulse collapses giving rise to two separate impulses that travel in opposite directions before dissipating at the boundaries returning the axon to a completely unexcited state. Both test cases are qualitatively consistent with the literature and share the following parameters:
(3) 
Error Analysis
Figure 3: Varying values of w.r.t at . fixed to 0.0098. 
Figure 4: Varying values of w.r.t at different values of . fixed to 0.5. 
Two experiments were ran to verify the suggested truncation errors. The numerical solution given by a sufficiently small step size serves as an analytic solution to , which is then used to evaluate how larger steps sizes deviate as measured by the root mean squared error. Fig. (3) looks at varying and shows that as the step size as halved, the resulting RMSE is quartered consistent with the expect quadratic truncation term of the scheme. Similarly, Fig. (4) looks at varying and shows that independent of , halving the step size results in an equally reduced RMSE consistent with the expected linear truncation term of the scheme.
Runtime Performance
Figure 5: Wall time for memory allocation, memory transfers, and core loop for both CPU and GPU. 
Figure 6: Wall time for just core loop for both CPU and GPU. 
Comparison of sequential CPU and parallel GPU bound algorithms is based on the wall time taken to perform a single iteration to calculate and . Reported figures are a function of spatial node quantity and the mean run time of 30 iterations. The main bottleneck to performance is transferring buffers between the host and device. Fig. (5) illustrates this effect. For the largest test case considered, , “GPU w/o tx” delivers 19x faster runtime over “GPU w/ tx” by not transferring intermediate results back to the host. Similarly, it delivers significantly faster runtime over the CPU by a factor of 73x. To better understand this performance boost, Fig. (6) looks at just the core loop execution time. For , the core loops account for only 3.7% and 14.9% of the execution time on the CPU and GPU respectively with the GPU delivering 18x better performance than the CPU. These percentages increase monotonically, and suggest that memory transfers will eventually be dominated by sufficiently large inputs on GPUs with an abundance of memory.
Adaptive Explicit Finite Difference Scheme
While investigating the traveling wavefront solution of the system, numerical oscillations were observed as the simulation zeroed in on the steady state solution. To address this issue, an adaptive explicit scheme was devised. A survey of the literature suggested a family of moving grid methods to solve the FitzHughNagumo system based on a Lagrangian formulation [VBSS90] and a method of lines formulation [Zwa11]. Here a heuristic is used to concentrate grid points where the most change takes place.
The formulation of the scheme is identical to the former section with second order spatial derivatives being approximated by Lagrange interpolating polynomials since the technique supports nonuniform grids. A three point, first order truncation error scheme is used. A five point, third order truncation error scheme was considered, but abandoned in favor of the empirically adequate three point scheme.
(4) 
The first and second spatial derivatives given by the Lagrange interpolating polynomials are used to decide how much detail is needed in the domain in addition to the first temporal derivative given by finite differences. First, a coarse grid is laid down across the entire domain to minimize the expected distance between nodes since the second order derivative has first order truncation error. Next, the magnitude of the first order spatial derivative (second order truncation error) is used to lay down a finer grid when the derivative is greater than a specified threshold. This corresponds to where the waves of the solution are.
Next, the temporal derivative is used to lay down an even finer grid in those areas having an absolute change above a specified threshold. The change in time corresponds to the dynamics of the system, by adding detail in these areas, we can preserve the behavior of the system. Finally, the zeros of the second spatial derivative serve as indicators of inflection points in the solution. These correspond most closely to the location of the traveling wavefronts of the equation. Here, the most detail is provided around a fixed radius of the inflection points since the width of the wavefronts does not depend on parameterization.
Each iteration, the explicit scheme is evaluated on the grid from the previous iteration and those results are then used to perform the grid building scheme. To map the available solution values to the new grid, the points are linearly interpolated if a new grid point falls between two previous points, or mapped directly if there is a stationary grid point between the two iterations. The latter will be the more common case since all grid points are chosen from an underlying uniform grid specified by the user. Linear interpolation will only take place when extra grid points are included in an area.
Experiments
Figure 7: Top: Numerical oscillation of a centered Gaussian impulse with and all other parameters the same as those given in previous section for . Bottom: Eliminated numerical oscillation based on the adaptive explicit scheme.
Fig. (7) is the motivating example for this scheme and demonstrates how numerical oscillations can be avoided by avoiding calculations in regions with stationary solutions. To demonstrate that the scheme works well for other test cases, the more interesting and dynamic divergent impulse test case is shown in Fig. (8). Here we can see that as time progresses, points are allocated to regions of the grid that are most influenced by the system’s dynamics without sacrificing quality. For the time steps shown, the adaptive scheme used 23x fewer nodes than the explicit scheme from the previous section.
Figure 8: Example of adaptive grid on the divergent impulse example for { 15.06, 30.13, 45.05, 60.12, 75.05, 90.11 }. Red lines given by the explicit scheme from the previous section, green dots given by adaptive explicit scheme.
Implicit Finite Difference Scheme
An implicit CrankNicolson scheme is used in this section to solve the FitzHughNagumo equations. To simplify the computations, is solved explicitly using a second order central difference scheme before solving leading to the following formulation:

(5) 
The truncation error for this scheme is with an improved, albeit still restrictive, requirement over the explicit scheme that . Based on this formulation, the righthand side is completely known when is calculated. This gives a simpler expression to consider:
(6) 
Newton’s method is used to solve the nonlinear function with an initial guess equal to the previous time step’s values, . To refine the estimate, the following system is solved iteratively until the magnitude of the refinement, , is less than a specified tolerance or machine epsilon.
(7) 
This formulation gives rise to a tridiagonal Jacobian matrix, , where the first and last row are specified by noflux boundary conditions, and constant offdiagonal entries and variable diagonal entries given by the partial derivatives of each nodal equation.

This tridiagonal system can be solved sequentially in time using the Thomas algorithm or in time using the Cyclic Reduction algorithm of [Hoc65]. Cyclic Reduction begins with a forward phase in which all odd indexed unknowns are eliminated recursively until a or system remains that can be solved directly. Cyclic Reduction ends with a backward phase that walks up the recursion stack to solve for the previously eliminated odd indexed unknowns. Since the algorithm decouples unknowns at each recursion level of the forward and backwards phases, these reductions can be done in parallel in time on the GPU assuming fold parallelism.
Further parallelism can be achieved by solving points explicitly along the domain, then using those results to create implicit subdomains that can be solved using either Thomas or Cyclic Reduction algorithms on a combination of multiple machines and multiple GPUs at the expense of additional communication and coordination.
Error Analysis
Figure 9: Varying values of w.r.t at . fixed to 0.000071. 
Figure 10: Varying values of w.r.t at different values of . fixed to 0.52. 
Evaluation of the spatial error revealed an unexpected linear behavior as shown in Fig. (9). As the spatial step is halved, the resulting error was expected to become quartered, instead it is halved. No clear explanation was discovered to account for this discrepancy. With respect to time, Fig. (10) shows that both the Thomas and Cyclic Reduction algorithms were quadratic as multiple points in time were evaluated. The Thomas algorithm produced aberrations as the step size increased eventually producing numerical instability, while the Cyclic Reduction algorithm was immune to this issue.
Figure 11: Stability of implicit solvers. 
Figure 12: Convergence of implicit solvers. 
In terms of stability, the implicit scheme is stable up to depending on which tridiagonal solver is used, and for comparison, the explicit scheme is stable up to precisely as shown in Fig. (11). The Thomas algorithm demonstrates a slightly weaker stability guarantee than the Cyclic Reduction algorithm becoming unstable around and respectively. In terms of convergence, the Thomas algorithm typically takes more iterations than Cyclic Reduction to obtain the same degree of accuracy as shown in Fig. (12). Taking the sequence of adjustments up to machine epsilon , Thomas algorithm gives suggesting qlinear convergence. Likewise, Cyclic Reduction gives suggesting qquadratic convergence.
Runtime Performance
Figure 13: Performance comparison of CPU and GPU bound Thomas and Cyclic Reduction algorithms. 
Figure 14: Performance comparison of Jacobian solvers. 
Sequential Thomas and Cyclic Reduction routines perform equally well on CPU as shown in Fig. (13). The parallel Cyclic Reduction method did not demonstrate significant performance gains on the GPU. However, looking at just the time taken to solve the Jacobian each iteration (not including initialization or memory transfers), parallel Cyclic Reduction on the GPU was 45x faster than both sequential CPU solvers as shown in Fig. (14).
To explain the poor performance of Cyclic Reduction on the GPU, there are a number of different factors at play. The algorithm is susceptible to warp divergence due to the large number of conditionals that take place. Reliance on global memory access with varying strides contributes to slow performance since shared memory can’t be effectively utilized, and each iteration the adjustments are transferred from device to host to decide if Newton’s method should terminate. These different factors suggest alternative GPU implementations need to be investigated to address these different issues.
Discussion
Work Environment
All results in this paper were based on code written in C#, and compiled using Microsoft Visual Studio Express with Release settings to run on a commodity class Intel Core i72360QM quad core processor. Open source CUDAfy.NET is used to run C# to CUDA translated code on a commodity class NVIDIA GeForce GT 525m having two streaming multiprocessors providing 96 CUDA cores.
Future Work
Numerically, additional work could be done on the adaptive explicit scheme. In preparing the scheme, a cubic splinebased approach was abandoned in favor of simpler approaches due to time pressures. It would be worthwhile to explore how to solve the system on a splinebased “grid”.
In addition, further work could be done to optimize the implementation of the parallel Cyclic Reduction algorithm on the GPU since it delivered disappointing runtime behavior compared to the sequential algorithm on the CPU. [CmWH14] mention several different optimizations to try, and I believe better global memory access will improve runtime at the expense of more complicated addressing. As an alternative to Cyclic Reduction, both [CmWH14] and [ZCO10] detail several different parallel tridiagonal solvers that could be deployed.
In terms of models considered, there are a number of different directions that could be pursued including higher dimensional [MC04], coupled [Cat14]], [Ril06], and stochastic variants [Tuc13] of the spatial FitzHughNagumo equation. Coming from a probabilistic background, I would be interested in investing time in learning how to solve stochastic ordinary and partial differential equations.
Conclusion
Three finite difference schemes were evaluated. An explicit scheme shows great performance on both the CPU and GPU, but it is susceptible to numerical oscillations. To address this issue, an adaptive explicit scheme based on heuristics was devised and is able to eliminate these issues while requiring fewer nodes to produce results onpar with the explicit scheme. An implicit scheme was evaluated which demonstrated a principled, and robust solution for a variety of test cases and is the favored approach of the three evaluated.
References
[AK15] Gianni Arioli and Hans Koch. Existence and stability of traveling pulse solutions of the fitzhughnagumo equation. Nonlinear Analysis: Theory, Methods and Applications, 113:5170, 2015
[Cat14] Anna Cattani. Fitzhughnagumo equations with generalized diffusive coupling. Mathematical Biosciences and Engineering, 11(2):203215, April 2014
[CmWH14] LiWen Chang and Wen mei W. Hwu. A guide for implementing tridiagonal solvers on gpus. In Numerical Computations with GPUs, pages 2944. Springer International Publishing, 2014.
[Fit61] Richard FitzHugh. Impulses and physiological states in theoretical models of nerve membrane. Biophysical journal, 1(6):445, 1961.
[HH52] Alan L. Hodgkin and Andrew F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of physiology, 117(4):500544, 1952.
[Hoc65] R. W. Hockney. A fast direct solution of poisson’s equation using fourier analysis. J. ACM, 12(1):95113, Jan 1965.
[Kee02] James P. Keener. Spatial modeling. In Computational Cell Biology,volume 20 of Interdisciplinary Applied Mathematics, pages 171197. Springer New York, 2002.
[MC04] Maria Murillo and XiaoChuan Cai. A fully implicit parallel algorithm for simulating the nonlinear electrical activity of the heart. Numerical linear algebra with applications, 11(23):261277, 2004.
[NAY62] J. Nagumo, S. Arimoto, and S. Yoshizawa. An active pulse transmission line simulating nerve axon. Proceedings of the IRE, 50(10):20612070, Oct 1962.
[Ott10] Denny Otten. Mathematical models of reaction diffusion systems, their numerical solutions and the freezing method with comsol multiphysics. 2010.
[Ril06] Caroline Jane Riley. Reaction and diffusion on the sierpinkski gasket. PhD thesis, University of Manchester, 2006.
[Tuc13] Henry C. Tuckwell. Stochastic partial differential equations in neurobiology. Linear and nonlinear models for spiking neurons. In Stochastic Biomathematical Models, volume 2058 of Lecture Notes in Mathematics. Springer Berlin Heidelberg, 2013.
[VBSS89] J. G. Verwer, J. G. Blom, and J. M. SanzSerna. An adaptive moving grid method for one dimensional systems of partial differential equations. Journal of Computational Physics, 82(2):454486, 1989.
[ZCO10] Yao Zhang, Jonathan Cohen, and John D. Owens. Fast tridiagonal solvers on the gpu. SIGPLAN Not., 45(5):127136, Jan 2010.
[Zwa11] M. N. Zwarts. A test set for an adaptive moving grid pde solver with timedependent adaptivity. Master’s thesis. Utrecht University, Utrecht, Netherlands, 2011.
kMeans Clustering using CUDAfy.NET
Introduction
I’ve been wanting to learn how to utilize general purpose graphics processing units (GPGPUs) to speed up computation intensive machine learning algorithms, so I took some time to test the waters by implementing a parallelized version of the unsupervised kmeans clustering algorithm using CUDAfy.NET– a C# wrapper for doing parallel computation on CUDAenabled GPGPUs. I’ve also implemented sequential and parallel versions of the algorithm in C++ (Windows API), C# (.NET, CUDAfy.NET), and Python (scikitlearn, numpy) to illustrate the relative merits of each technology and paradigm on three separate benchmarks: varying point quantity, point dimension, and cluster quantity. I’ll cover the results, and along the way talk about performance and development considerations of the three technologies before wrapping up with how I’d like to utilize the GPGPU on more involved machine learning algorithms in the future.
Algorithms
Sequential
The traditional algorithm attributed to [Stu82] begins as follows:
 Pick points at random as the starting centroid of each cluster.
 do (until convergence)
 For each point in data set:
 labels[point] = Assign(point, centroids)
 centroids = Aggregate(points, labels)
 convergence = DetermineConvergence()
 For each point in data set:
 return centroids
Assign
labels each point with the label of the nearest centroid, and Aggregate
updates the positions of the centroids based on the new point assignments. In terms of complexity, let’s start with the Assign
routine. For each of the points we’ll compute the distance to each of the centroids and pick the centroid with the shortest distance that we’ll assign to the point. This is an example of the Nearest Neighbor Search problem. Linear search gives which is preferable to using something like kd trees which requires repeated superlinear construction and querying. Assuming Euclidean distance and points from , this gives time complexity . The Aggregate
routine will take . Assuming convergence is guaranteed in iterations then the resulting complexity is which lends to an effectively linear algorithm.
Parallel
[LiFa89] was among the first to study several different shared memory parallel algorithms for kmeans clustering, and here I will be going with the following one:
 Pick points at random as the starting centroid of each cluster.
 Partition points into equally sized sets
 Run to completion threadId from 1 to as:
 do (until convergence)
 sum, count = zero(), zero()
 For each point in partition[threadId]:
 label = Assign(point, centroids)
 For each dim in point:
 sum[ * label + dim] += point[dim]
 count[label] = count[label] + 1
 if(barrier.Synchronize())
 centroids = sum / count
 convergence = DetermineConvergence()
 do (until convergence)
 return centroids
The parallel algorithm can be viewed as smaller instances of the sequential algorithm processing chunks of points in parallel. There are two main departures from the sequential approach 1) future centroid positions are accumulated and counted after each labeling and 2) each iteration of while loops are synchronized before continuing on to the next iteration using a barrier – a way of ensuring all threads wait for the last thread to arrive, then continue to wait as the last one enters the barrier, and exits allowing the other threads to exit.
In terms of time complexity, Assign
remains unchanged at , and incrementing the sums and counts for the point’s label takes time . Thus for points, a single iteration of the loop gives time. Given threads, the maximum time would be given by the thread that enters the barrier, and assuming at most iterations, then the overall complexity is . Which suggests we should see at most a speedup over the sequential implementation for large values of .
GPGPU
The earliest work I found on doing kmeans clustering on NVIDIA hardware in the academic literature was [MaMi09]. The following is based on that work, and the work I did above on the parallel algorithm:
 Pick points at random as the starting centroid of each cluster.
 Partition into blocks such that each block contains no more than points
 do (until convergence)
 Initialize sums, counts to zero
 Process blockId 1 to , at a time in parallel on the GPGPU:
 If threadId == 0
 Initialize blockSum, blockCounts to zero
 Synchronize Threads
 label = Assign(points[blockId * + threadId], centroids)
 For each dim in points[blockId * + threadId]:
 atomic blockSum[label * pointDim + dim] += points[blockId * + threadId]
 atomic blockCount[label] += 1
 Synchronize Threads
 If threadId == 0
 atomic sums += blockSum
 atomic counts += blockCounts
 If threadId == 0
 centroids = sums / counts
 convergence = DetermineConvergence()
The initialization phase is similar to the parallel algorithm, although now we need to take into account the way that the GPGPU will process data. There are a handful of Streaming Multiprocessors on the GPGPU that process a single “block” at a time. Here we assign no more than points to a block such that each point runs as a single thread to be executed on each of the CUDA cores of the Streaming Multiprocessor.
When a single block is executing we’ll initialize the running sum and count as we did in the parallel case, then request that the threads running synchronize, then proceed to calculate the label of the point assigned to the thread atomically update the running sum and count. The threads must then synchronize again, and this time only the very first thread atomically copy those block level sum and counts over to the global sum and counts shared by all of the blocks.
Let’s figure out the time complexity. A single thread in a block being executed by a Streaming Multiprocessor takes time assuming that all threads of the block execute in parallel, that there are blocks, and Streaming Multiprocessors, then the complexity becomes: . Since , and at most iterations can go by in parallel, we are left with . So the expected speedup over the sequential algorithm should be .
Expected performance
For large values of , if we allow to be significantly larger than , we should expect the parallel version to 8x faster than the sequential version and the GPGPU version to be 255x faster than the sequential version given that for the given set of hardware that will be used to conduct tests. For to be significantly larger than , then parallel is the same, and GPGPU version should be 340x faster than the sequential version. Now, it’s very important to point out that these are upper bounds. It is most likely that observed speedups will be significantly less due to technical issues like memory allocation, synchronization, and caching issues that are not incorporated (and difficult to incorporate) into the calculations.
Implementations
I’m going to skip the sequential implementation since it’s not interesting. Instead, I’m going to cover the C++ parallel and C# GPGPU implementations in detail, then briefly mention how scikitlearn was configured for testing.
C++
The parallel Windows API implementation is straightforward. The following will begin with the basic building blocks, then get into the high level orchestration code. Let’s begin with the barrier implementation. Since I’m running on Windows 7, I’m unable to use the convenient InitializeSynchronizationBarrier
, EnterSynchronizationBarrier
, and DeleteSynchronizationBarrier
API calls beginning with Windows 8. Instead I opted to implement a barrier using a condition variable and critical section as follows:
//  // Synchronization utility functions //  struct Barrier { CONDITION_VARIABLE conditionVariable; CRITICAL_SECTION criticalSection; int atBarrier; int expectedAtBarrier; }; void deleteBarrier(Barrier* barrier) { DeleteCriticalSection(&(barrier>criticalSection)); // No API for delete condition variable } void initializeBarrier(Barrier* barrier, int numThreads) { barrier>atBarrier = 0; barrier>expectedAtBarrier = numThreads; InitializeConditionVariable(&(barrier>conditionVariable)); InitializeCriticalSection(&(barrier>criticalSection)); } bool synchronizeBarrier(Barrier* barrier, void (*func)(void*), void* data) { bool lastToEnter = false; EnterCriticalSection(&(barrier>criticalSection)); ++(barrier>atBarrier); if (barrier>atBarrier == barrier>expectedAtBarrier) { barrier>atBarrier = 0; lastToEnter = true; func(data); WakeAllConditionVariable(&(barrier>conditionVariable)); } else { SleepConditionVariableCS(&(barrier>conditionVariable), &(barrier>criticalSection), INFINITE); } LeaveCriticalSection(&(barrier>criticalSection)); return lastToEnter; }
A Barrier
struct contains the necessary details of how many threads have arrived at the barrier, how many are expected, and structs for the condition variable and critical section.
When a thread arrives at the barrier (synchronizeBarrier
) it requests the critical section before attempting to increment the atBarrier
variable. It checks to see if it is the last to arrive, and if so, resets the number of threads at the barrier to zero and invokes the callback to perform post barrier actions exclusively before notifying the other threads through the condition variable that they can resume. If the thread is not the last to arrive, then it goes to sleep until the condition variable is invoked. The reason why LeaveCriticalSection
is included outside the the if statement is because SleepConditionVariableCS
will release the critical section before putting the thread to sleep, then reacquire the critical section when it awakes. I don’t like that behavior since its an unnecessary acquisition of the critical section and slows down the implementation.
There is a single allocation routine which performs a couple different rounds of error checking when calling calloc
; first to check if the routine returned null, and second to see if it set a Windows error code that I could inspect from GetLastError
. If either event is true, the application will terminate.
//  // Allocation utility functions //  void* checkedCalloc(size_t count, size_t size) { SetLastError(NO_ERROR); void* result = calloc(count, size); DWORD lastError = GetLastError(); if (result == NULL) { fprintf(stdout, "Failed to allocate %d bytes. GetLastError() = %d.", size, lastError); ExitProcess(EXIT_FAILURE); } if (result != NULL && lastError != NO_ERROR) { fprintf(stdout, "Allocated %d bytes. GetLastError() = %d.", size, lastError); ExitProcess(EXIT_FAILURE); } return result; }
Now on to the core of the implementation. A series of structs are specified for those data that are shared (e.g., points, centroids, etc) among the threads, and those that are local to each thread (e.g., point boundaries, partial results).
//  // Parallel Implementation //  struct LocalAssignData; struct SharedAssignData { Barrier barrier; bool continueLoop; int numPoints; int pointDim; int K; double* points; double* centroids; int* labels; int maxIter; double change; double pChange; DWORD numProcessors; DWORD numThreads; LocalAssignData* local; }; struct LocalAssignData { SharedAssignData* shared; int begin; int end; int* labelCount; double* partialCentroids; };
The assign
method does exactly what was specified in the parallel algorithm section. It will iterate over the portion of points it is responsible for, compute their labels and its partial centroids (sum of points with label , division done at aggregate step.).
void assign(int* label, int begin, int end, int* labelCount, int K, double* points, int pointDim, double* centroids, double* partialCentroids) { int* local = (int*)checkedCalloc(end  begin, sizeof(int)); int* localCount = (int*)checkedCalloc(K, sizeof(int)); double* localPartial = (double*)checkedCalloc(pointDim * K, sizeof(double)); // Process a chunk of the array. for (int point = begin; point < end; ++point) { double optDist = INFINITY; int optCentroid = 1; for (int centroid = 0; centroid < K; ++centroid) { double dist = 0.0; for (int dim = 0; dim < pointDim; ++dim) { double d = points[point * pointDim + dim]  centroids[centroid * pointDim + dim]; dist += d * d; } if (dist < optDist) { optDist = dist; optCentroid = centroid; } } local[point  begin] = optCentroid; ++localCount[optCentroid]; for (int dim = 0; dim < pointDim; ++dim) localPartial[optCentroid * pointDim + dim] += points[point * pointDim + dim]; } memcpy(&label[begin], local, sizeof(int) * (end  begin)); free(local); memcpy(labelCount, localCount, sizeof(int) * K); free(localCount); memcpy(partialCentroids, localPartial, sizeof(double) * pointDim * K); free(localPartial); }
One thing that I experimented with that gave me better performance was allocating and using memory within the function instead of allocating the memory outside and using within the assign
routine. This in particular was motivated after I read about false sharing where two separate threads writing to the same cache line cause coherence updates to cascade in the CPU causing overall performance to degrade. For labelCount
and partialCentroids
they’re reallocated since I was concerned about data locality and wanted the three arrays to be relatively in the same neighborhood of memory. Speaking of which, memory coalescing is used for the points array so that point dimensions are adjacent in memory to take advantage of caching. Overall, a series of cache friendly optimizations.
The aggregate
routine follows similar set of enhancements. The core of the method is to compute the new centroid locations based on the partial sums and centroid assignment counts given by args>shared>local[t].partialCentroids
and args>shared>local[t].labelCount[t]
. Using these partial results all the routine to complete in time which assuming all of these parameters are significantly less than , gives a constant time routine. Once the centroids have been updated, the change in their location is computed and used to determine convergence along with how many iterations have gone by. Here if more than 1,000 iterations have occurred or the relative change in position is less than some tolerance (0.1%) then the threads will terminate.
void aggregate(void * data) { LocalAssignData* args = (LocalAssignData*)data; int* assignmentCounts = (int*)checkedCalloc(args>shared>K, sizeof(int)); double* newCentroids = (double*)checkedCalloc(args>shared>K * args>shared>pointDim, sizeof(double)); // Compute the assignment counts from the work the threads did. for (int t = 0; t < args>shared>numThreads; ++t) for (int k = 0; k < args>shared>K; ++k) assignmentCounts[k] += args>shared>local[t].labelCount[k]; // Compute the location of the new centroids based on the work that the // threads did. for (int t = 0; t < args>shared>numThreads; ++t) for (int k = 0; k < args>shared>K; ++k) for (int dim = 0; dim < args>shared>pointDim; ++dim) newCentroids[k * args>shared>pointDim + dim] += args>shared>local[t].partialCentroids[k * args>shared>pointDim + dim]; for (int k = 0; k < args>shared>K; ++k) for (int dim = 0; dim < args>shared>pointDim; ++dim) newCentroids[k * args>shared>pointDim + dim] /= assignmentCounts[k]; // See by how much did the position of the centroids changed. args>shared>change = 0.0; for (int k = 0; k < args>shared>K; ++k) for (int dim = 0; dim < args>shared>pointDim; ++dim) { double d = args>shared>centroids[k * args>shared>pointDim + dim]  newCentroids[k * args>shared>pointDim + dim]; args>shared>change += d * d; } // Store the new centroid locations into the centroid output. memcpy(args>shared>centroids, newCentroids, sizeof(double) * args>shared>pointDim * args>shared>K); // Decide if the loop should continue or terminate. (max iterations // exceeded, or relative change not exceeded.) args>shared>continueLoop = args>shared>change > 0.001 * args>shared>pChange && (args>shared>maxIter) > 0; args>shared>pChange = args>shared>change; free(assignmentCounts); free(newCentroids); }
Each individual thread follows the same specification as given in the parallel algorithm section, and follows the calling convention required by the Windows API.
DWORD WINAPI assignThread(LPVOID data) { LocalAssignData* args = (LocalAssignData*)data; while (args>shared>continueLoop) { memset(args>labelCount, 0, sizeof(int) * args>shared>K); // Assign points cluster labels assign(args>shared>labels, args>begin, args>end, args>labelCount, args>shared>K, args>shared>points, args>shared>pointDim, args>shared>centroids, args>partialCentroids); // Tell the last thread to enter here to aggreagate the data within a // critical section synchronizeBarrier(&(args>shared>barrier), aggregate, args); }; return 0; }
The parallel algorithm controller itself is fairly simple and is responsible for basic preparation, bookkeeping, and cleanup. The number of processors is used to determine the number of threads to launch. The calling thread will run one instance will the remaining instances will run on separate threads. The data is partitioned, then the threads are spawned using the CreateThread
routine. I wish there was a Windows API that would allow me to simultaneously create threads with a specified array of arguments because CreateThread will automatically start the thread as soon as it’s created. If lots of threads are being created, then the first will wait a long time before the last one gets around to reaching the barrier. Subsequent iterations of the synchronized loops will have better performance, but it would be nice to avoid that initial delay. After kicking off the threads, the main thread will run its own block of data, and once all threads terminate, the routine will close open handles and free allocated memory.
void kMeansFitParallel(double* points, int numPoints, int pointDim, int K, double* centroids) { // Lookup and calculate all the threading related values. SYSTEM_INFO systemInfo; GetSystemInfo(&systemInfo); DWORD numProcessors = systemInfo.dwNumberOfProcessors; DWORD numThreads = numProcessors  1; DWORD pointsPerProcessor = numPoints / numProcessors; // Prepare the shared arguments that will get passed to each thread. SharedAssignData shared; shared.numPoints = numPoints; shared.pointDim = pointDim; shared.K = K; shared.points = points; shared.continueLoop = true; shared.maxIter = 1000; shared.pChange = 0.0; shared.change = 0.0; shared.numThreads = numThreads; shared.numProcessors = numProcessors; initializeBarrier(&(shared.barrier), numProcessors); shared.centroids = centroids; for (int i = 0; i < K; ++i) { int point = rand() % numPoints; for (int dim = 0; dim < pointDim; ++dim) shared.centroids[i * pointDim + dim] = points[point * pointDim + dim]; } shared.labels = (int*)checkedCalloc(numPoints, sizeof(int)); // Create thread workload descriptors LocalAssignData* local = (LocalAssignData*)checkedCalloc(numProcessors, sizeof(LocalAssignData)); for (int i = 0; i < numProcessors; ++i) { local[i].shared = &shared; local[i].begin = i * pointsPerProcessor; local[i].end = min((i + 1) * pointsPerProcessor, numPoints); local[i].labelCount = (int*)checkedCalloc(K, sizeof(int)); local[i].partialCentroids = (double*)checkedCalloc(K * pointDim, sizeof(double)); } shared.local = local; // Kick off the threads HANDLE* threads = (HANDLE*)checkedCalloc(numThreads, sizeof(HANDLE)); for (int i = 0; i < numThreads; ++i) threads[i] = CreateThread(0, 0, assignThread, &local[i + 1], 0, NULL); // Do work on this thread so that it's just not sitting here idle while the // other threads are doing work. assignThread(&local[0]); // Clean up WaitForMultipleObjects(numThreads, threads, true, INFINITE); for (int i = 0; i < numThreads; ++i) CloseHandle(threads[i]); free(threads); for (int i = 0; i < numProcessors; ++i) { free(local[i].labelCount); free(local[i].partialCentroids); } free(local); free(shared.labels); deleteBarrier(&(shared.barrier)); }
C#
The CUDAfy.NET GPGPU C# implementation required a lot of experimentation to find an efficient solution.
In the GPGPU paradigm there is a host and a device in which sequential operations take place on the host (ie. managed C# code) and parallel operations on the device (ie. CUDA code). To delineate between the two, the [Cudafy]
method attribute is used on the static public method assign
. The set of host operations are all within the Fit
routine.
Under the CUDA model, threads are bundled together into blocks, and blocks together into a grid. Here the data is partitioned so that each block consists of half the maximum number of threads possible per block and the total number of blocks is the number of points divided by that quantity. This was done through experimentation, and motivated by Thomas Bradley’s Advanced CUDA Optimization workshop notes [pdf] that suggest at that regime the memory lines become saturated and cannot yield better throughput. Each block runs on a Streaming Multiprocessor (a collection of CUDA cores) having shared memory that the threads within the block can use. These blocks are then executed in pipeline fashion on the available Streaming Multiprocessors to give the desired performance from the GPGPU.
What is nice about the shared memory is that it is much faster than the global memory of the GPGPU. (cf. Using Shared Memory in CUDA C/C++) To make use of this fact the threads will rely on two arrays in shared memory: sum of the points and the count of those belonging to each centroid. Once the arrays have been zeroed out by the threads, all of the threads will proceed to find the nearest centroid of the single point they are assigned to and then update those shared arrays using the appropriate atomic operations. Once all of the threads complete that assignment, the very first thread will then add the arrays in shared memory to those in the global memory using the appropriate atomic operations.
using Cudafy; using Cudafy.Host; using Cudafy.Translator; using Cudafy.Atomics; using System; namespace CUDAfyTesting { public class CUDAfyKMeans { [Cudafy] public static void assign(GThread thread, int[] constValues, double[] centroids, double[] points, float[] outputSums, int[] outputCounts) { // Unpack the const value array int pointDim = constValues[0]; int K = constValues[1]; int numPoints = constValues[2]; // Ensure that the point is within the boundaries of the points // array. int tId = thread.threadIdx.x; int point = thread.blockIdx.x * thread.blockDim.x + tId; if (point >= numPoints) return; // Use two shared arrays since they are much faster than global // memory. The shared arrays will be scoped to the block that this // thread belongs to. // Accumulate the each point's dimension assigned to the k'th // centroid. When K = 128 => pointDim = 2; when pointDim = 128 // => K = 2; Thus max(len(sharedSums)) = 256. float[] sharedSums = thread.AllocateShared<float>("sums", 256); if (tId < K * pointDim) sharedSums[tId] = 0.0f; // Keep track of how many times the k'th centroid has been assigned // to a point. max(K) = 128 int[] sharedCounts = thread.AllocateShared<int>("counts", 128); if (tId < K) sharedCounts[tId] = 0; // Make sure all threads share the same shared state before doing // any calculations. thread.SyncThreads(); // Find the optCentroid for point. double optDist = double.PositiveInfinity; int optCentroid = 1; for (int centroid = 0; centroid < K; ++centroid) { double dist = 0.0; for (int dim = 0; dim < pointDim; ++dim) { double d = centroids[centroid * pointDim + dim]  points[point * pointDim + dim]; dist += d * d; } if (dist < optDist) { optDist = dist; optCentroid = centroid; } } // Add the point to the optCentroid sum for (int dim = 0; dim < pointDim; ++dim) // CUDA doesn't support double precision atomicAdd so cast down // to float... thread.atomicAdd(ref(sharedSums[optCentroid * pointDim + dim]), (float)points[point * pointDim + dim]); // Increment the optCentroid count thread.atomicAdd(ref(sharedCounts[optCentroid]), +1); // Wait for all of the threads to complete populating the shared // memory before storing the results back to global memory where // the host can access the results. thread.SyncThreads(); // Have to do a lock on both of these since some other Streaming // Multiprocessor could be running and attempting to update the // values at the same time. // Copy the shared sums to the output sums if (tId == 0) for (int i = 0; i < K * pointDim; ++i) thread.atomicAdd(ref(outputSums[i]), sharedSums[i]); // Copy the shared counts to the output counts if (tId == 0) for (int i = 0; i < K; i++) thread.atomicAdd(ref(outputCounts[i]), sharedCounts[i]); }
Before going on to the Fit
method, let’s look at what CUDAfy.NET is doing under the hood to convert the C# code to run on the CUDAenabled GPGPU. Within the CUDAfy.Translator
namespace there are a handful of classes for decompiling the application into an abstract syntax tree using ICharpCode.Decompiler
and Mono.Cecil
, then converting the AST over to CUDA C via visitor pattern, next compiling the resulting CUDA C using NVIDIA’s NVCC compiler, and finally the compilation result is relayed back to the caller if there’s a problem; otherwise, a CudafyModule
instance is returned, and the compiled CUDA C code it represents loaded up on the GPGPU. (The classes and method calls of interest are: CudafyTranslator.DoCudafy
, CudaLanguage.RunTransformsAndGenerateCode
, CUDAAstBuilder.GenerateCode
, CUDAOutputVisitor
and CudafyModule.Compile
.)
private CudafyModule cudafyModule; private GPGPU gpgpu; private GPGPUProperties properties; public int PointDim { get; private set; } public double[] Centroids { get; private set; } public CUDAfyKMeans() { cudafyModule = CudafyTranslator.Cudafy(); gpgpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId); properties = gpgpu.GetDeviceProperties(true); gpgpu.LoadModule(cudafyModule); }
The Fit
method follows the same paradigm that I presented earlier with the C++ code. The main difference here is the copying of managed .NET resources (arrays) over to the device. I found these operations to be relatively time intensive and I did find some suggestions from the CUDAfy.NET website on how to use pinned memory essentially copy the managed memory to unmanaged memory, then do an asynchronous transfer from the host to the device. I tried this with the points arrays since its the largest resource, but did not see noticeable gains so I left it as is.
At the beginning of each iteration of the main loop, the device counts and sums are cleared out through the Set
method, then the CUDA code is invoked using the Launch
routine with the specified block and grid dimensions and device pointers. One thing that the API does is return an array when you allocate or copy memory over to the device. Personally, an IntPtr
seems more appropriate. Execution of the routine is very quick, where on some of my tests it took 1 to 4 ms to process 100,000 two dimensional points. Once the routine returns, memory from the device (sum and counts) is copied back over to the host which then does a quick operation to derive the new centroid locations and copy that memory over to the device for the next iteration.
public void Fit(double[] points, int pointDim, int K) { if (K <= 0) throw new ArgumentOutOfRangeException("K", "Must be greater than zero."); if (pointDim <= 0) throw new ArgumentOutOfRangeException("pointDim", "Must be greater than zero."); if (points.Length < pointDim) throw new ArgumentOutOfRangeException("points", "Must have atleast pointDim entries."); if (points.Length % pointDim != 0) throw new ArgumentException("points.Length must be n * pointDim > 0."); int numPoints = points.Length / pointDim; // Figure out the partitioning of the data. int threadsPerBlock = properties.MaxThreadsPerBlock / 2; int numBlocks = (numPoints / threadsPerBlock) + (numPoints % threadsPerBlock > 0 ? 1 : 0); dim3 blockSize = new dim3(threadsPerBlock, 1, 1); dim3 gridSize = new dim3( Math.Min(properties.MaxGridSize.x, numBlocks), Math.Min(properties.MaxGridSize.y, (numBlocks / properties.MaxGridSize.x) + (numBlocks % properties.MaxGridSize.x > 0 ? 1 : 0)), 1 ); int[] constValues = new int[] { pointDim, K, numPoints }; float[] assignmentSums = new float[pointDim * K]; int[] assignmentCount = new int[K]; // Initial centroid locations picked at random Random prng = new Random(); double[] centroids = new double[K * pointDim]; for (int centroid = 0; centroid < K; centroid++) { int point = prng.Next(points.Length / pointDim); for (int dim = 0; dim < pointDim; dim++) centroids[centroid * pointDim + dim] = points[point * pointDim + dim]; } // These arrays are only read from on the GPU they are never written // on the GPU. int[] deviceConstValues = gpgpu.CopyToDevice<int>(constValues); double[] deviceCentroids = gpgpu.CopyToDevice<double>(centroids); double[] devicePoints = gpgpu.CopyToDevice<double>(points); // These arrays are written written to on the GPU. float[] deviceSums = gpgpu.CopyToDevice<float>(assignmentSums); int[] deviceCount = gpgpu.CopyToDevice<int>(assignmentCount); // Set up main loop so that no more than maxIter iterations take // place, and that a realative change less than 1% in centroid // positions will terminate the loop. int maxIter = 1000; double change = 0.0, pChange = 0.0; do { pChange = change; // Clear out the assignments, and assignment counts on the GPU. gpgpu.Set(deviceSums); gpgpu.Set(deviceCount); // Lauch the GPU portion gpgpu.Launch(gridSize, blockSize, "assign", deviceConstValues, deviceCentroids, devicePoints, deviceSums, deviceCount); // Copy the results memory from the GPU over to the CPU. gpgpu.CopyFromDevice<float>(deviceSums, assignmentSums); gpgpu.CopyFromDevice<int>(deviceCount, assignmentCount); // Compute the new centroid locations. double[] newCentroids = new double[centroids.Length]; for (int centroid = 0; centroid < K; ++centroid) for (int dim = 0; dim < pointDim; ++dim) newCentroids[centroid * pointDim + dim] = assignmentSums[centroid * pointDim + dim] / assignmentCount[centroid]; // Calculate how much the centroids have changed to decide // whether or not to terminate the loop. change = 0.0; for (int centroid = 0; centroid < K; ++centroid) for (int dim = 0; dim < pointDim; ++dim) { double d = newCentroids[centroid * pointDim + dim]  centroids[centroid * pointDim + dim]; change += d * d; } // Update centroid locations on CPU & GPU Array.Copy(newCentroids, centroids, newCentroids.Length); deviceCentroids = gpgpu.CopyToDevice<double>(centroids); } while (change > 0.01 * pChange && maxIter > 0); gpgpu.FreeAll(); this.Centroids = centroids; this.PointDim = pointDim; } } }
Python
I include the Python implementation for the sake of demonstrating how scikitlearn was invoked throughout the following experiments section.
model = KMeans( n_clusters = numClusters, init='random', n_init = 1, max_iter = 1000, tol = 1e3, precompute_distances = False, verbose = 0, copy_x = False, n_jobs = numThreads ); model.fit(X); // X = (numPoints, pointDim) numpy array.
Experimental Setup
All experiments where conducted on a laptop with an Intel Core i72630QM Processor and NVIDIA GeForce GT 525M GPGPU running Windows 7 Home Premium. C++ and C# implementations were developed and compiled by Microsoft Visual Studio Express 2013 for Desktop targeting C# .NET Framework 4.5 (Release, Mixed Platforms) and C++ (Release, Win32). Python implementation was developed and compiled using Eclipse Luna 4.4.1 targeting Python 2.7, scikitlearn 0.16.0, and numpy 1.9.1. All compilers use default arguments and no extra optimization flags.
For each test, each reported test point is the median of thirty sample run times of a given algorithm and set of arguments. Run time is computed as the (wall) time taken to execute model.fit(points, pointDim, numClusters)
where time is measured by: QueryPerformanceCounter
in C++, System.Diagnostics.Stopwatch
in C#, and time.clock
in Python. Every test is based on a dataset having two natural clusters at .25 or .25 in each dimension.
Results
Varying point quantity
Both the C++ and C# sequential and parallel implementations outperform the Python scikitlearn implementations. However, the C++ sequential and parallel implementations outperforms their C# counterparts. Though the C++ sequential and parallel implementations are tied, as it seems the overhead associated with multithreading overrides any multithreaded performance gains one would expect. The C# CUDAfy.NET implementation surprisingly does not outperform the C# parallel implementation, but does outperform the C# sequential one as the number of points to cluster increases.
So what’s the deal with Python scikitlearn? Why is the parallel version so slow? Well, it turns out I misunderstood the nJobs parameter. I interpreted this to mean that process of clustering a single set of points would be done in parallel; however, it actually means that the number of simultaneous runs of the whole process will occur in parallel. I was tipped off to this when I noticed multiple python.exe fork processes being spun off which surprised me that someone would implement a parallel routine that way leading to a more thorough reading the scikitlearn documentation. There is parallelism going on with scikitlearn, just not the desired type. Taking that into account the linear one performs reasonably well for being a dynamically typed interpreted language.
Varying point dimension
The C++ and C# parallel implementations exhibit consistent improved run time over their sequential counterparts. In all cases the performance is better than scikitlearn’s. Surprisingly, the C# CUDAfy.NET implementation does worse than both the C# sequential and parallel implementations. Why do we not better CUDAfy.NET performance? The performance we see is identical to the vary point quantity test. So on one hand it’s nice that increasing the point dimensions did not dramatically increase the run time, but ideally, the CUDAfy.NET performance should be better than the sequential and parallel C# variants for this test. My leading theory is that higher point dimensions result in more data that must be transferred between host and device which is a relatively slow process. Since I’m short on time, this will have to be something I investigate in more detail in the future.
Varying cluster quantity
As in the point dimension test, the C++ and C# parallel implementations outperform their sequential counterparts, while the scikitlearn implementation starts to show some competitive performance. The exciting news of course is that varying the cluster size finally reveals improved C# CUDAfy.NET run time. Now there is some curious behavior at the beginning of each plot. We get performance for two clusters, then jump up into about for four to eight clusters. Number of points and their dimension are held constant, but we allocate a few extra double’s for the cluster centroids. I believe this has to do with cache behavior. I’m assuming for fewer than four clusters everything that’s needed sits nicely in the fast L1 cache, and moving up to four and more clusters requires more exchanging of data between L1, L2, L3, and (slower) memory memory to the different cores of the Intel Core i72630QM processor I’m using. As before, I’ll need to do some more tests to verify that this is what is truly happening.
Language comparison
For the three tests considered, the C++ implementations gave the best run time performance on point quantity and point dimension tests while the C# CUDAfy.NET implementation gave the best performance on the cluster quantity test.
The C++ implementation could be made to run faster be preallocating memory in the same fashion that C# does. In C# when an application is first created a block of memory is allocated for the managed heap. As a result, allocation of reference types in C# is done by incrementing a pointer instead of doing an unmanaged allocation (malloc, etc.). (cf. Automatic Memory Management) This allocation takes place before executing the C# routines, while the same allocation takes place during the C++ routines. Hence, the C++ run times will have an overhead not present in the C# run times. Had I implemented memory allocation in C++ the same as it’s done in C#, then the C++ implementation would be undoubtedly even faster than the C# ones.
While using scikitlearn in Python is convenient for exploratory data analysis and prototyping machine learning algorithms, it leaves much to be desired in performance; frequently coming ten times slower than the other two implementations on the varying point quantity and dimension tests, but within tolerance on the vary cluster quantity tests.
Future Work
The algorithmic approach here was to parallelize work on data points, but as the dimension of each point increases, it may make sense to explore algorithms that parallelize work across dimensions instead of points.
I’d like to spend more time figuring out some of the highperformance nuances of programming the GPGPU (as well as traditional C++), which take more time and patience than a week or two I spent on this. In addition, I’d like to dig a little deeper into doing CUDA C directly rather than through the convenient CUDAfy.NET wrapper; as well as explore OpenMP and OpenCL to see how they compare from a development and performanceoriented view to CUDA.
Python and scikitlearn were used a baseline here, but it would be worth spending extra time to see how R and Julia compare, especially the latter since Julia pitches itself as a highperformance solution, and is used for exploratory data analysis and prototyping machine learning systems.
While the emphasis here was on trying out CUDAfy.NET and getting some exposure to GPGPU programming, I’d like to apply CUDAfy.NET to the expectation maximization algorithm for fitting multivariate Gaussian mixture models to a dataset. GMMs are a natural extension of kmeans clustering, and it will be good to implement the more involved EM algorithm.
Conclusions
Through this exercise, we can expect to see modest speedups over sequential implementations of about 2.62x and 11.69x in the C# parallel and GPGPU implementations respectively when attempting to find large numbers of clusters on low dimensional data. Fortunately the way you use kmeans clustering is to find the cluster quantity that maximizes the Bayesian information criterion or Akaike information criterion which means running the vary centroid quantity test on real data. On the other hand, most machine learning data is of a high dimension so further testing (on a real data set) would be needed to verify it’s effectiveness in a production environment. Nonetheless, we’ve seen how parallel and GPGPU based approaches can reduce the time it takes to complete the clustering task, and learned some things along the way that can be applied to future work.
Bibliography
[LiFa89] Li Xiaobo and Fang Zhixi, “Parallel clustering algorithms”, Parallel Computing, 1989, 11(3): pp.275290.
[MaMi09] Mario Zechner, Michael Granitzer. “Accelerating KMeans on the Graphics Processor via CUDA.” First International Conference on Intensive Applications and Services, INTENSIVE’09. pp. 715, 2009.
[Stu82] Stuart P. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28:129137, 1982.
Notes from SIGGRAPH 2015
Introduction
I recently flew out to Los Angeles to attend the 42nd International Conference and Exhibition on Computer Graphics and Interactive Techniques. SIGGRAPH‘s theme this year was the crossroads of discovery bringing it closer to its roots that began here in Boulder, Colorado back in 1974. For me it was a chance to dig a little deeper into Computer Graphics research following my recent studies and develop a better understanding of the industries pushing the domain forward. As with most posts on this site, this is a short reminder to myself, and hopefully gives others an idea of what they could expect if they went.
Production Sessions
Disney – Pixar’s “Lava”: Moving Mountains was an informative production session detailing the process of bringing “Lava” to the screen. “Lava” is the story of Uku, a lonely volcano in search of love. As millions of years go by, he begins to lose hope as he recedes back into the ocean. But all is not lost. Uku finds renewed hope for love as newly formed volcano Lele rises to the surface. After the Pixar magicians reveal their secrets, technical details, and engrossing backstory, “Lava” becomes an even more enjoyable short film.
The presentation began with director James Murphy explaining his personal story inspiring the short before giving a live performance of the titular song. Colin Levy followed Murphy’s conceptualization, story boarding, and clay mockups with how the film would be framed for maximal emotional impact. Levy explain the exploratory process of filming the opening scene of the film to find the right combination of lenses, and flight paths based on realworld references to help illustrate the size and scale of Uku, the hopeless volcano.
Both Aaron Hartline and Austin Lee continued discussing the challenges of animating and rigging Uku, Lele, a pair of dolphins, birds, whales, and turtles (the last four representing young love, newly weds, established lives, and life long love). In particular, the different approaches for animating and rigging the facial features of Uku (eyelids, lips, checks, and so on) and how the teams iterated to find a balance between what the audience might expect from an anthropomorphic mountain and what they wanted to achieve as story tellers.
Perhaps the most interesting moment in the presentation was Dirk Van Gelder’s sneak peak of the enhancements the team made to Presto (Pixar’s inhouse animation tool) to provide animators final render quality realtime feedback of their changes through a clever combination of Rendermanbased final renders and OpenGL hardware texturing. Aside from the technical novelty, it’s a great example of time saving enhancements that make it easier for people to freely experiment and explore different approaches leading to better results.
The closing discussion by Byron Bashforth and Farhez Rayani on shading and lighting was informative and it was interesting to see how the procedural approaches were done to give Uku both a physically realistic and visually appealing biome consisting of different shaders, and static and procedural assets. Overall, a very interesting peak into the workflow of one of the most venerable studios in the industry.
Birds of a Feather
Having worked in the healthcare space for a fair bit of time, I was attracted to meetings on Volume Rendering and Medical Visualization and HealthTech: Modeling, Interaction, Hardware, and Analysis to see what people have been working on and to get a glimpse of where things are heading.
Nicholas Polys of Virginia Tech and Michael Aratow (MD) (both chairs of the Web3D Consortium Medical Working Group) began the medical visualization discussion by going over common libraries such as VTK (The Visualization Toolkit) and Voreen (Volume Rendering Engine), before discussing general purpose analysis and visualization tools such as Paraview. Volume oriented applications such as Seg3D (volume segmentation tool), OsiriX (DICOM viewer) were covered and finally, tools for exploring biomolecular systems such as Chimera, VMD (Visual Molecular Dynamics) and PathSim (EpsteinBarr Virus exploration) were discussed giving the audience a good lay of the land. Brief bit of time was given to surgical training tools based on 3D technologies and haptic feedback (e.g. H3D).
These were all interesting applications and seeing how they all work using different types of humanmachine interfaces (standard workstations, within CAVE environments, or even in virtual reality headsets and gloves) was eye opening. The second main theme of the discussion was on standardization when it comes to interoperability and reproducibility. There was a heavy push for X3D along with interoperability with DICOM. Like a lot of massive standards, DICOM has some wiggle room in it that leads to inconsistent implementations from vendors. That makes portability of data between disparate systems complicated (not to mention DICOM incorporates nongraphical metadata such as complex HL7). Suffice to say X3D is biting off a big chunk of work, and I think it will take some time for them to make progress in healthcare since it’s a fragmented industry that is not in the least bit technologically progressive.
One area I felt was absent during the discussion was how 3D graphics could be used to benefit everyday patients. There is a wealth of fMRI and ECoG data that patients could benefit from seeing in an accessible way for example showing a patient a healthy baseline, then accentuating parts of their own data and explaining how those anomalies affect their wellbeing. If a component can be developed to deliver that functionality, then it can be incorporated into a patient portal alongside all other charts and information that providers have accumulated for the patient.
The HealthTech discussion was presented by Ramesh Raskar, and his graduate students and postdocs from the MIT Media Lab. They presented a number of lowcost, lowpower diagnostic devices for retinal imaging and electroretinography, highspeed tomography, cellphonebased microscopy, skin perfusion photography, and dental imaging. Along with more social oriented technologies for identify safe streets to travel, and automatically discerning mental health from portraits. There were plenty of interesting applications being developed by the group, but it was more of a show and tell by the group than discussing the types of challenges beyond the scope of the work by MIT Media Lab (as impressive as they are). (For example, The fine work 3Shape A/S has done with fast scanning of teeth for digital dentistry.)
One thing that was discussed of key interest was Meddit a way for medical practitioners and researchers to define open problems to maturity, then presenting those challenges to computer scientists to work on and develop solutions. While the company name is uninspired, I think this is the right kind of collaboration platform for the “toolmaker” view of hardware engineers, computer scientists and software engineers as it identifies a real issue, presents an opportunity, and gives a pool of talented, bright people a way to make a difference. I am skeptical that it will take off (I think it would have more success as a niche community within an umbrella collaboration platform i.e. Stack Exchange model), but the idea is sound and something people should get excited about.
RealTime Live!
The challenge of realtime graphics is very appealing to me and getting to see what different software studios are working on was a real treat. While there were several presentations and awards given during the two hour long event, three demos stood out to me. Balloon Burst given by Miles Macklin of NVIDIA, BabyX presented by Mark Sagar of University of Auckland, and award winner A Boy and His Kite demoed by Nick Penwarden of Epic Games.
Macklin’s demo was impressive in that it simulated more than 750,000 particles (250,000 by their solver Flex, and 512,000 for mist and droplets) and their paper [pdf] Fast GridFree Surface Tracking gave some technical background into how they achieved their results. Fluid simulation is something I’d like to spend some time exploring, obviously won’t be able to create something as technical as Macklin’s group, but would like to spend some time on SmoothedParticle Hydrodynamics, and seeing NVIDIA’s work was a good motivation boost to explore the subject further on my own.
Perhaps the most unexpected entry in the series was Sagar’s BabyX. It was a fascinating assemblage of neural networks, real time graphics, natural language processing, computer vision, and image processing to create the ultimate “Sims” like character a baby that could learn and invoke different emotional responses based on external stimuli. Realtime graphics were photorealistic, and seeing the modeling behind the system to emulate how the brain behaves in the presence of different dopamine levels (and how those levels correspond to things like Parkinson’s and schizophrenia) was impressive as well. Overall, a fantastic technical achievement and I look forward to following Sagar’s work as it continues to evolve.
My main interest in going to RealTime Live! was to see Penwarden’s work on A Boy and His Kite. This impressive demo spanning hundred square miles inspired by the Isle of Skye really puts to shame my prior work in creating procedural environments. Nonetheless, it goes to show to far the medium can be pushed and how small the divide between realtime and film is becoming. Computer Graphics World published (JulyAugust 2015) a very thorough technical overview [p. 4048] of how Penwarden’s team produced the short, in addition to the features added to Unreal Engine 4 to make the demo shine.
Wrapup
There were many other things I explored that I won’t go into detail namely the VR Village, Emerging Technologies, Research Posters, Exhibition, and Job Fair. I’m still quite skeptical that virtual reality (and to the same extent augmented reality) technologies will come into the mainstream; I think they’ll continue to be the subject of researchers, gaming enthusiasts, and industry solutions for automotive, and healthcare problems. One thing that was a bit of a disappointment was the Job Fair as there were barely any companies participating. Overall, a positive experience learning what other people are doing in the industry, and getting to see how research is being applied in a variety of different domains including automotive, entertainment, engineering, healthcare, and science.