Posts Tagged ‘C++’
GPU Accelerated Expectation Maximization for Gaussian Mixture Models using CUDA
C, CUDA, and Python source code available on GitHub
Introduction
Gaussian Mixture Models [1, 435439] offer a simple way to capture complex densities by employing a linear combination of multivariate normal distributions, each with their own mean, covariance, and mixture coefficient, , s.t. .
Of practical interest is the learning of the number of components and the values of the parameters. Evaluation criteria, such as Akaike and Bayesian, can be used to identify the number of components, or nonparametric models like Dirichlet processes can be used to avoid the matter all together. We won’t cover these techniques here, but will instead focus on finding the values of the parameters given sufficient training data using the ExpectationMaximization algorithm [3], and doing so efficiently on the GPU. Technical considerations will be discussed and the work will conclude with an empirical evaluation of sequential and parallel implementations for the CPU, and a massively parallel implementation for the GPU for varying numbers of components, points, and point dimensions.
Multivariate Normal Distribution
The multivariate normal distribution With mean, , and symmetric, positive definite covariance, , is given by:
From a computational perspective, we will be interested in evaluating the density for values. Thus, a naive implementation would be bounded by due to the matrix determinate in the normalization term. We can improve upon this by computing the Cholesky factorization, , where is a lower triangular matrix [6, 157158]. The factorization requires time and computing the determinate becomes by taking advantage of the fact that . Further, we can precompute the factorization and normalization factor for a given parameterization which leaves us with complexity of the Mahalanobis distance given by the quadratic form in the exponential. Naive computation requires one perform two vector matrix operations and find the inverse of the covariance matrix with worst case behavior . Leveraging the Cholesky factorization, we’ll end up solving a series of triangular systems by forward and backward substitution in and completing an inner product in as given by , , and . Thus, our preinitialization time is and density determination given by . Further optimizations are possible by considering special diagonal cases of the covariance matrix, such as the isotropic, , and nonisotropic, , configurations. For robustness, we’ll stick with the full covariance.
To avoid numerical issues such as overflow and underflow, we’re going to consider throughout the remainder of the work. For estimates of the covariance matrix, we will want more samples than the dimension of the data to avoid a singular covariance matrix [4]. Even with this criteria satisfied, it may still be possible to produce a singular matrix if some of the data are collinear and span a subspace of .
Expectation Maximization
From an unsupervised learning point of view, GMMs can be seen as a generalization of kmeans allowing for partial assignment of points to multiple classes. A possible classifier is given by . Alternatively, multiple components can be used to represent a single class and we argmax over the corresponding subset sums. The utility of of GMMs goes beyond classification, and can be used for regression as well. The ExpectationMaximization (EM) algorithm will be used to find the parameters of of the model by starting with an initial guess for the parameters given by uniform mixing coefficients, means determined by the kmeans algorithm, and spherical covariances for each component. Then, the algorithm iteratively computes probabilities given a fixed set of parameters, then updating those parameters by maximizing the loglikelihood of the data:
Because we are dealing with exponents and logarithms, it’s very easy to end up with underflow and overflow situations, so we’ll continue the trend of working in logspace and also make use of the “logsumexp trick” to avoid these complications:
Where the term is the maximum exponential argument within a stated sum. Within the expectation stage of the algorithm we will compute the posterior distributions of the components conditioned on the training data (we omit the mixing coefficient since it cancels out in the maximization steps of and , and account for it explicitly in the update of ):
The new parameters are resolved within the maximization step:
The algorithm continues back and forth between expectation and maximization stages until the change in log likelihood is less than some epsilon, or a maximum number of user specified iterations has elapsed.
Implementations
Sequential Per iteration complexity given by . We expect because too many dimensions leads to a lot of dead space and too many components results in overfitting of the data. Thus, the dominating term for sequential execution is given by .
Parallel There are two natural data parallelisms that appear in the algorithm. The calculation of the and across points, while the probability densities and parameter updates have natural parallelisms across components. Each POSIX thread runs the full iterative algorithm with individual stages coordinated by barrier synchronization. Resulting complexity is given by for work coordinated across processors.
Massively Parallel The parallel implementation can be taken and mapped over to the GPU with parallelism taken across points and components depending on the terms being computed. There are several types of parallelism that we will leverage under the CUDA programming model. For the calculation of we compute each point in parallel by forming a grid of one dimensional blocks, and use streams with event synchronization to carry out each component in parallel across the streaming multiprocessors. Calculation of the loglikelihood and is done by computing and storing , then updating the storage for , and then performing a parallel reduction over to produce the loglikelihood. Parallel reductions are a core tasks are implemented by first standardizing the input array of points to an supremum power of two, then reducing each block using shared memory, and applying a linear map to the memory so that successive block reductions can be applied. Several additional approaches are discussed in [5]. Once the loglikelihood is computed, the streams are synchronized with the host and the result is copied from the device back to the host. To compute , is copied to a working memory and a maximum parallel reduction is performed. The resulting maximum is used in a separate exponential map for numerical stability when computing the parallel reduction of each component to yield . Updates to the mean and covariances are performed by mapping each term to a working memory allocated for each component’s stream and executing a parallel reduction to yield the updated mean and covariance. Once all component streams have been synchronized, the mixture coefficients and Cholesky decompositions of the covariances is computed with a single kernel invocation parallel in the number of components.
The main design consideration was whether or not use streams. For larger numbers of components, this will result in improved runtime performance, however, it comes at the cost of increased memory usage which limits the size of problems an end user can study with the implementation. Because the primary design goal is performance, the increase in memory was favorable to using less memory and executing each component sequentially.
To optimize the runtime of the implementation nvprof along with the NVIDIA Visual Profiler was used to identify performance bottlenecks. The original implementation was a naive port of the parallel C code which required frequent memory transfers between host and device resulting in significant CUDA API overhead that dominated the runtime. By transferring and allocating memory on the device beforehand, this allowed the implementation to execute primarily on the GPU and eliminate the API overhead. The second primary optimization was using streams and events for parallelization of the component probability densities and parameter updates in the maximization step. In doing so, this allowed for a fold reduction since the components calculations would be performed in parallel. The next optimization step was to streamline the parallel reductions by using block reductions against fast shared block memory minimizing the number of global memory writes instead of performing iterated reductions against sequential addressing that preformed global memory reads and writes for each point. The final optimization step was to used pinned host memory to enable zerocopy transfers from DRAM to the GPU over DMA.
Evaluation
To evaluate the implementations we need a way of generating GMMs and sampling data from the resulting distributions. To sample from a standard univariate normal distribution one can use The BoxMuller transform, Zigguart method, or Ratioofuniforms method [7]. The latter is used here due to its simplicity and efficiency. Sampling from the multivariate normal distribution can by done by sampling a standard normal vector and computing where can be computed by Eigendecomposition, , or Cholesky factorization, . The latter is used since it is more efficient. The GMM describes a generative process whereby we pick a component at random with probability given by its mixture coefficient and then sample the underlying distribution, and perform this process for the desired number of points.
The matter of generating GMMs it more interesting. Here we draw for , alternatively, one could draw . Means are drawn by with so that means are relatively spread out in . The more exciting prospect is how to sample the covariance matrix. This is where the Wishart distribution, for , comes in handy. The Wishart distribution is a model of what the sample covariance matrix should look like given a series of vectors. Based on a method by [8], [9] gives an equally efficient method for sampling by letting and for and .
To evaluate the performance of the different implementations, the wall clock time taken to run the algorithm on a synthetic instance was measured by varying each of the , , and parameters while holding the other two fixed. From an end user perspective wall clock time is preferable to the time the operating system actually devoted to the problem since wall clock time is more valuable. There will be variability in the results since each instance requires a different number of iterations for the log likelihood to converge. Tests were conducted on a Xeon 1245 v5 3.5 Ghz system with 32GB of memory and an NVIDIA GTX 1060 6GB graphics card with 1280 cores.
Since the parameter space is relatively large Figures 25 look at varying one parameter will fixing the others to demonstrate the relative merits of each approach. When the number of points dominates the CUDA approach tends to be 18x faster; the Parallel approach tends to be 3x faster when the dimension is high; and CUDA is suitable when the num of components is high giving a 20x improvement relative to the sequential approach. Thus, when dealing with suitably large datasets, the CUDA based implementation is preferable delivering superior runtime performance without sacrificing quality.
It is important to note that the results obtained from the CUDA solution may differ to those the sequential and parallel approaches. This is due to nondeterministic round off errors associated with executing parallel reductions compared to sequential reductions [2], and differences in the handling of floating point values on the GPU [10], notably, the presence of fused multiple add on NVIDIA GPUs which are more accurate than what is frequently implemented in CPU architectures. The following two synthetic data sets illustrate typical results of the three schemes:
Conclusion
This work demonstrated the utility of using NVIDIA GPUs to train Gaussian mixture models by the Expectation Maximization algorithm. Speedups as high as 20x were observed on synthetic datasets by varying the number of points, components, and data dimension while leaving the others fixed. It is believed that further speedups should be possible with additional passes, and the inclusion of metric data structures to limit which data is considered during calculations. Future work would pursue more memory efficient solutions on the GPU to allow for larger problem instance, and focus on providing higher level language bindings so that it can be better utilized in traditional data science toolchains.
References
 Bishop, C. M. Pattern recognition and machine learning. Springer, 2006.
 Collange, S., Defour, D., Graillat, S., and Lakymhuk, R. Numerical reproducibility for the parallel reduction on multi and manycore architectures. Parallel Computing 49 (2015), 8397.
 Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the eme algorithm. Journal of the royal statistical society. Series B (methodological) (1977), 138.
 Fan, J., Liao, Y., and Liu, H. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal 19, (2016) C1C32.
 Harris, M. Optimizing cuda. SC07: High Performance Computing with CUDA (2007).
 Kincaid, D., and Cheney, W. Numerical analysis: mathematics of scientific computing. 3 ed. Brooks/Cole, 2002.
 Kinderman, A. J., and Monahan, J. F. Computer generation of random variables using the ratio of uniform deviates. ACM Transactions on Mathematical Software (TOMS) 3, 3 (1977), 257260.
 Odell, P., and Feiveson, A. A Numerical procedure to generate a sample covariance matrix. Journal of the American Statistical Association 61, 313 (1966), 199203.
 Sawyer, S. Wishart distributions and inversewishart sampling. URL: http://www.math.wustl.edu/~sawyer/hmhandouts/Wishart.pdf (2007).
 Whitehead, N., and FitFlorea, A. Precision and performance: Floating point and ieee 754 compliance for nvidia gpus. rn(A + B) 21., 1 (2011), 1874919424.
A Greedy Approximation Algorithm for the Linear Assignment Problem
Starting today, I will be posting some of the related source code for articles on GitHub.
Introduction
The Linear Assignment Problem (LAP) is concerned with uniquely matching an equal number of workers to tasks, , such that the overall cost of the pairings is minimized. A polynomial time algorithm was developed in the late fifties by [6], and further refined by [9], called the Hungarian method. Named so after the work of Hungarian mathematicians KÃ¶nig and EgervÃ¡ry whose theorems in the 1930s form the basis for the method. While the Hungarian Method can solve LAP instances in time, we wish to find faster algorithms even if it means sacrificing optimality in the process. Here we examine a greedy approximation algorithm with runtime in terms of its approximation factor and compare it empirically to the Hungarian method.
Linear Assignment Problem
The above linear program has cost, , and assignment, , matrices that specify the terms of the LAP. This is equivalent to finding a perfect matching in a weighted bipartite graph. A minimal cost may have several possible assignments, but we are only interested in finding just one. It is assumed that no one worker can do all jobs more efficiently by themselves than the distributing work across all workers. Likewise, if the costs are thought of as durations, then the minimum cost is the minimum sequential rather than parallel time taken to complete the tasks.
From a practical point of view, we may relax the integral constraint on and allow all positive realvalued costs. For instances where there are more jobs than workers, and vice versa, dummy entries valued greater than the existing maximum may be added. Minimizing the cost is the default objective, but the maximum cost can be found by finding the optimal assignment for , then finding the cost relative to .
Algorithms
Brute Force Rather than using the mathematical programming or graph theoretic representation of the problem, we can instead view the problem as finding the assignment that minimizes the cost out of all possible assignments:
There are such assignments which can be produced using an iterative version of Heap’s algorithm [5] in time assuming one does differential scoring (opposed to calculating the score for each permutation which would result in an algorithm.)
Random The random algorithm selects a permutation uniformly from the set of all possible assignment permutations in time using the FisherYates shuffle [4]. This obviously does not produce an optimal or nearoptimal solution, but serves as a straw man to compare other results.
Greedy The greedy heuristic continues to cover the row and column of the smallest uncovered entry in the cost matrix until all entries are covered. The resulting set of entries then constitutes the assignment of workers to jobs. An inefficient algorithm can be used to find the smallest entry every iteration, or a more efficient result of can be obtained through the use of a sorted, array indexed hybrid mesh and queue. Let represent a tuple consisting of row, column, and value; the previous entry in the matrix this value, and the next entry in this matrix this value; and the (left, above, right, below) that are adjacent to this node.
Algorithm 1 A greedy algorithm for the LAP.

 // Adjacent node left, above, right, below properties
 // Sort in ascending order by node value
 // Adjacent node previous and next properties

 // Deletes row and col of
Allocating and linking for assignment is ; mesh ; queue . Therefore, initialization requires time. The body of the loop requires a constant time assignment of worker to job, and time to remove the row and column from a matrix using a modified depth first search. Thus, the loop itself accounts for time. The resulting time complexity is therefore .
Breaking ties for the minimum uncovered value can result in different costs. This drawback is shown in the above example were choosing at yields a minimum cost of , where as the one at gives a minimum cost of . The next progression in the design of the greedy algorithm would be to try all minimum positions and keep the top performing paths.
Hungarian The general idea behind the KuhnMunkres algorithm is that if we are given an initial assignment, we can make further assignments and potentially reassign workers until all workers have been tasked with a job. The highlevel sketch of the algorithm starts with an initial assignment. While we have jobs that are unassigned, we look for qualified workers, ie, the zero entries. If a worker is already assigned to a job, but is also qualified for another, then we prime the alternative and continue to the next qualified worker, but if that is the only job the worker is qualified for, then we’d like to reassign any other worker already tasked to that job. This leads to a natural ripple effect represented by an alternating path of starred and primed entries. In Munkres’ paper [9] “starred” zero’s represent assignments of workers to jobs, and “primed” zero’s are alternative assignments. By flipping the bits of the path, we reassign workers to their alternative tasks while ensuring the assignment continues to be minimal by construction. After assigning as many workers as we have to, we then deduct the lowest cost to create a new qualified worker. Thus, every iteration we are guaranteed to make positive progress towards our goal of finding an optimal assignment. This scheme results in the worst case time to complete.
Algorithm 2 The Hungarian method for the LAP.

 Star the first uncovered zero in row , cover the corresponding column for
 All columns not covered
 Uncovered zeros
 Prime the current uncovered zero
 There’s a starred zero in this row
 Uncover the starred zero’s column and cover the row

 Find an alternating augmented path from the primed zero
 Unstar the starred zeros on the path and star the primed zeros on the path
 Remove all the prime markings and cover all stared zeros
 Found path

 over all uncovered
 for all uncovered columns
 for all covered rows
 Uncovered zeros
 Starred zeros // These are all the assignments
To further illustrate the algorithm, consider the following example where starred entries are denoted by red, and primed entries by green:
Analysis
The prevailing convention in the literature is to look at the approximation factor, , to determine how close the results of an approximation algorithm are to optimal [10]. Here this ratio is the expected minimum cost assignment of the algorithm under test to the same quantity given by the expected minimum assignment cost. Let be an a standard exponential random cost matrix. We resort to the exponential distribution for its ease of analyis and prominence in related literature. Cf. the works of [7], [8] for analysis based on .
Exponential Distribution Properties Let have cumulative distribution function and expectation . The distribution demonstrates the memoryless property for expectations . Define the order statistic to be the minimum of draws from . [2] with expectation . If then with expectation .
Expected Minimum Cost The expected minimum assignment cost for is given by [1]:
Which is the generalized harmonic number of order two and converges to . For the generalized harmonic numbers, , for .
Greedy The minimum value of an matrix is given by the order statistic with expectation . The expected value of the minimum cost assignment is not just because the expectation doesn’t take into account the previous iteration’s minimum value. To accomplish this we make use of the memoryless property of the exponential distribution to observe that the expected difference in minimums between iterations is the expected minimum value given by . If we add up all these differences we get the expected minimum value of the k’th iteration; summing all these expectations then yields the expected minimum cost assignment:
This is the harmonic number of order one which does not converge. The resulting approximation factor is:
Random The random algorithm will simply select an assignment permutation, so we are just adding up distributed random variables leading to an expected cost of:
And approximation factor:
From this analysis one concludes that the greedy algorithm has an unbounded approximation factor that grows significantly slower than that of randomly selecting assignments.
Evaluation
To illustrate the preceding results, Figure 1 shows the approximation factor for the greedy algorithm implementations against the derived approximation factor. The simulated results are based on 120 standard exponentially distributed matrices for . Using the same conventions for the approximation factor, Figure 2 illustrates the runtime characteristics of the algorithms after rejecting outliers due to system fluctuations. Results obtained from source code compiled with O3 flags and ran on a Xeon E31245 v5 3.5 Ghz system with 32 GBs of 2133Mhz DDR4 RAM. The algorithms coincide with the theoretical time complexities as shown in Table 2.
Solver  MSE 

GREEDYEFFICIENT  0.002139 
GREEDYNAIVE  0.014161 
HUNGARIAN  0.232998 
Summary
Brute  Random  Greedy  Hungarian  

Complexity  
1  1 
Exact solutions can be delivered by the brute method when a handful of workers are being considered, and the Hungarian method should be considered for all other instances. Approximate solutions can be provided by the greedy algorithm with logarithmic degeneracy while providing a linear factor improvement over the Hungarian method. For inputs greater than those considered, the parallel Auction algorithm [3] is a suitable alternative and the subject of future work.
References
 Aldous, D. J. The limit in the random assignment problem. Random Structures & Algorithms 18, 4 (2001), 381418.
 Balakrishnan, N., and Rao, C. Handbook of statistics 16: Order statisticstheory and methods, 2000.
 Bertsekas, D. P. The auction algorithm: A distributed relaxation method for the assignment problem. Annals of operation research 4, 1 (1988), 105123.
 Durtenfeld, R. Algorithm 235: random permutation. Communications of the ACM 7, 7 (1964), 420.
 Heap, B. Permutations by interchanges. The Computer Journal 6, 3 (1963), 293298.
 Kuhn, H. W. The hungarian method for the assignment problem. Naval research logistics quarterly 2, 12 (1955), 83097.
 Kurtzberg, J. M. On approximation methods for the assignment problem. Journal of the ACM (JACM) 9, 4 (1962), 419439.
 Steele, M. J. Probability and statistics in the service of computer science: illustrations using the assignment problem. Communications in StatisticsTheory and Methods 19, 11 (1990), 43154329.
 Munkres, J. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5, 1 (1957), 3238.
 Williamson, D. P., and Shmoys, D. B. The design of approximation algorithms. Cambridge university press, 2011.
kMeans Clustering using CUDAfy.NET
Introduction
I’ve been wanting to learn how to utilize general purpose graphics processing units (GPGPUs) to speed up computation intensive machine learning algorithms, so I took some time to test the waters by implementing a parallelized version of the unsupervised kmeans clustering algorithm using CUDAfy.NET– a C# wrapper for doing parallel computation on CUDAenabled GPGPUs. I’ve also implemented sequential and parallel versions of the algorithm in C++ (Windows API), C# (.NET, CUDAfy.NET), and Python (scikitlearn, numpy) to illustrate the relative merits of each technology and paradigm on three separate benchmarks: varying point quantity, point dimension, and cluster quantity. I’ll cover the results, and along the way talk about performance and development considerations of the three technologies before wrapping up with how I’d like to utilize the GPGPU on more involved machine learning algorithms in the future.
Algorithms
Sequential
The traditional algorithm attributed to [Stu82] begins as follows:
 Pick points at random as the starting centroid of each cluster.
 do (until convergence)
 For each point in data set:
 labels[point] = Assign(point, centroids)
 centroids = Aggregate(points, labels)
 convergence = DetermineConvergence()
 For each point in data set:
 return centroids
Assign
labels each point with the label of the nearest centroid, and Aggregate
updates the positions of the centroids based on the new point assignments. In terms of complexity, let’s start with the Assign
routine. For each of the points we’ll compute the distance to each of the centroids and pick the centroid with the shortest distance that we’ll assign to the point. This is an example of the Nearest Neighbor Search problem. Linear search gives which is preferable to using something like kd trees which requires repeated superlinear construction and querying. Assuming Euclidean distance and points from , this gives time complexity . The Aggregate
routine will take . Assuming convergence is guaranteed in iterations then the resulting complexity is which lends to an effectively linear algorithm.
Parallel
[LiFa89] was among the first to study several different shared memory parallel algorithms for kmeans clustering, and here I will be going with the following one:
 Pick points at random as the starting centroid of each cluster.
 Partition points into equally sized sets
 Run to completion threadId from 1 to as:
 do (until convergence)
 sum, count = zero(), zero()
 For each point in partition[threadId]:
 label = Assign(point, centroids)
 For each dim in point:
 sum[ * label + dim] += point[dim]
 count[label] = count[label] + 1
 if(barrier.Synchronize())
 centroids = sum / count
 convergence = DetermineConvergence()
 do (until convergence)
 return centroids
The parallel algorithm can be viewed as smaller instances of the sequential algorithm processing chunks of points in parallel. There are two main departures from the sequential approach 1) future centroid positions are accumulated and counted after each labeling and 2) each iteration of while loops are synchronized before continuing on to the next iteration using a barrier – a way of ensuring all threads wait for the last thread to arrive, then continue to wait as the last one enters the barrier, and exits allowing the other threads to exit.
In terms of time complexity, Assign
remains unchanged at , and incrementing the sums and counts for the point’s label takes time . Thus for points, a single iteration of the loop gives time. Given threads, the maximum time would be given by the thread that enters the barrier, and assuming at most iterations, then the overall complexity is . Which suggests we should see at most a speedup over the sequential implementation for large values of .
GPGPU
The earliest work I found on doing kmeans clustering on NVIDIA hardware in the academic literature was [MaMi09]. The following is based on that work, and the work I did above on the parallel algorithm:
 Pick points at random as the starting centroid of each cluster.
 Partition into blocks such that each block contains no more than points
 do (until convergence)
 Initialize sums, counts to zero
 Process blockId 1 to , at a time in parallel on the GPGPU:
 If threadId == 0
 Initialize blockSum, blockCounts to zero
 Synchronize Threads
 label = Assign(points[blockId * + threadId], centroids)
 For each dim in points[blockId * + threadId]:
 atomic blockSum[label * pointDim + dim] += points[blockId * + threadId]
 atomic blockCount[label] += 1
 Synchronize Threads
 If threadId == 0
 atomic sums += blockSum
 atomic counts += blockCounts
 If threadId == 0
 centroids = sums / counts
 convergence = DetermineConvergence()
The initialization phase is similar to the parallel algorithm, although now we need to take into account the way that the GPGPU will process data. There are a handful of Streaming Multiprocessors on the GPGPU that process a single “block” at a time. Here we assign no more than points to a block such that each point runs as a single thread to be executed on each of the CUDA cores of the Streaming Multiprocessor.
When a single block is executing we’ll initialize the running sum and count as we did in the parallel case, then request that the threads running synchronize, then proceed to calculate the label of the point assigned to the thread atomically update the running sum and count. The threads must then synchronize again, and this time only the very first thread atomically copy those block level sum and counts over to the global sum and counts shared by all of the blocks.
Let’s figure out the time complexity. A single thread in a block being executed by a Streaming Multiprocessor takes time assuming that all threads of the block execute in parallel, that there are blocks, and Streaming Multiprocessors, then the complexity becomes: . Since , and at most iterations can go by in parallel, we are left with . So the expected speedup over the sequential algorithm should be .
Expected performance
For large values of , if we allow to be significantly larger than , we should expect the parallel version to 8x faster than the sequential version and the GPGPU version to be 255x faster than the sequential version given that for the given set of hardware that will be used to conduct tests. For to be significantly larger than , then parallel is the same, and GPGPU version should be 340x faster than the sequential version. Now, it’s very important to point out that these are upper bounds. It is most likely that observed speedups will be significantly less due to technical issues like memory allocation, synchronization, and caching issues that are not incorporated (and difficult to incorporate) into the calculations.
Implementations
I’m going to skip the sequential implementation since it’s not interesting. Instead, I’m going to cover the C++ parallel and C# GPGPU implementations in detail, then briefly mention how scikitlearn was configured for testing.
C++
The parallel Windows API implementation is straightforward. The following will begin with the basic building blocks, then get into the high level orchestration code. Let’s begin with the barrier implementation. Since I’m running on Windows 7, I’m unable to use the convenient InitializeSynchronizationBarrier
, EnterSynchronizationBarrier
, and DeleteSynchronizationBarrier
API calls beginning with Windows 8. Instead I opted to implement a barrier using a condition variable and critical section as follows:
//  // Synchronization utility functions //  struct Barrier { CONDITION_VARIABLE conditionVariable; CRITICAL_SECTION criticalSection; int atBarrier; int expectedAtBarrier; }; void deleteBarrier(Barrier* barrier) { DeleteCriticalSection(&(barrier>criticalSection)); // No API for delete condition variable } void initializeBarrier(Barrier* barrier, int numThreads) { barrier>atBarrier = 0; barrier>expectedAtBarrier = numThreads; InitializeConditionVariable(&(barrier>conditionVariable)); InitializeCriticalSection(&(barrier>criticalSection)); } bool synchronizeBarrier(Barrier* barrier, void (*func)(void*), void* data) { bool lastToEnter = false; EnterCriticalSection(&(barrier>criticalSection)); ++(barrier>atBarrier); if (barrier>atBarrier == barrier>expectedAtBarrier) { barrier>atBarrier = 0; lastToEnter = true; func(data); WakeAllConditionVariable(&(barrier>conditionVariable)); } else { SleepConditionVariableCS(&(barrier>conditionVariable), &(barrier>criticalSection), INFINITE); } LeaveCriticalSection(&(barrier>criticalSection)); return lastToEnter; }
A Barrier
struct contains the necessary details of how many threads have arrived at the barrier, how many are expected, and structs for the condition variable and critical section.
When a thread arrives at the barrier (synchronizeBarrier
) it requests the critical section before attempting to increment the atBarrier
variable. It checks to see if it is the last to arrive, and if so, resets the number of threads at the barrier to zero and invokes the callback to perform post barrier actions exclusively before notifying the other threads through the condition variable that they can resume. If the thread is not the last to arrive, then it goes to sleep until the condition variable is invoked. The reason why LeaveCriticalSection
is included outside the the if statement is because SleepConditionVariableCS
will release the critical section before putting the thread to sleep, then reacquire the critical section when it awakes. I don’t like that behavior since its an unnecessary acquisition of the critical section and slows down the implementation.
There is a single allocation routine which performs a couple different rounds of error checking when calling calloc
; first to check if the routine returned null, and second to see if it set a Windows error code that I could inspect from GetLastError
. If either event is true, the application will terminate.
//  // Allocation utility functions //  void* checkedCalloc(size_t count, size_t size) { SetLastError(NO_ERROR); void* result = calloc(count, size); DWORD lastError = GetLastError(); if (result == NULL) { fprintf(stdout, "Failed to allocate %d bytes. GetLastError() = %d.", size, lastError); ExitProcess(EXIT_FAILURE); } if (result != NULL && lastError != NO_ERROR) { fprintf(stdout, "Allocated %d bytes. GetLastError() = %d.", size, lastError); ExitProcess(EXIT_FAILURE); } return result; }
Now on to the core of the implementation. A series of structs are specified for those data that are shared (e.g., points, centroids, etc) among the threads, and those that are local to each thread (e.g., point boundaries, partial results).
//  // Parallel Implementation //  struct LocalAssignData; struct SharedAssignData { Barrier barrier; bool continueLoop; int numPoints; int pointDim; int K; double* points; double* centroids; int* labels; int maxIter; double change; double pChange; DWORD numProcessors; DWORD numThreads; LocalAssignData* local; }; struct LocalAssignData { SharedAssignData* shared; int begin; int end; int* labelCount; double* partialCentroids; };
The assign
method does exactly what was specified in the parallel algorithm section. It will iterate over the portion of points it is responsible for, compute their labels and its partial centroids (sum of points with label , division done at aggregate step.).
void assign(int* label, int begin, int end, int* labelCount, int K, double* points, int pointDim, double* centroids, double* partialCentroids) { int* local = (int*)checkedCalloc(end  begin, sizeof(int)); int* localCount = (int*)checkedCalloc(K, sizeof(int)); double* localPartial = (double*)checkedCalloc(pointDim * K, sizeof(double)); // Process a chunk of the array. for (int point = begin; point < end; ++point) { double optDist = INFINITY; int optCentroid = 1; for (int centroid = 0; centroid < K; ++centroid) { double dist = 0.0; for (int dim = 0; dim < pointDim; ++dim) { double d = points[point * pointDim + dim]  centroids[centroid * pointDim + dim]; dist += d * d; } if (dist < optDist) { optDist = dist; optCentroid = centroid; } } local[point  begin] = optCentroid; ++localCount[optCentroid]; for (int dim = 0; dim < pointDim; ++dim) localPartial[optCentroid * pointDim + dim] += points[point * pointDim + dim]; } memcpy(&label[begin], local, sizeof(int) * (end  begin)); free(local); memcpy(labelCount, localCount, sizeof(int) * K); free(localCount); memcpy(partialCentroids, localPartial, sizeof(double) * pointDim * K); free(localPartial); }
One thing that I experimented with that gave me better performance was allocating and using memory within the function instead of allocating the memory outside and using within the assign
routine. This in particular was motivated after I read about false sharing where two separate threads writing to the same cache line cause coherence updates to cascade in the CPU causing overall performance to degrade. For labelCount
and partialCentroids
they’re reallocated since I was concerned about data locality and wanted the three arrays to be relatively in the same neighborhood of memory. Speaking of which, memory coalescing is used for the points array so that point dimensions are adjacent in memory to take advantage of caching. Overall, a series of cache friendly optimizations.
The aggregate
routine follows similar set of enhancements. The core of the method is to compute the new centroid locations based on the partial sums and centroid assignment counts given by args>shared>local[t].partialCentroids
and args>shared>local[t].labelCount[t]
. Using these partial results all the routine to complete in time which assuming all of these parameters are significantly less than , gives a constant time routine. Once the centroids have been updated, the change in their location is computed and used to determine convergence along with how many iterations have gone by. Here if more than 1,000 iterations have occurred or the relative change in position is less than some tolerance (0.1%) then the threads will terminate.
void aggregate(void * data) { LocalAssignData* args = (LocalAssignData*)data; int* assignmentCounts = (int*)checkedCalloc(args>shared>K, sizeof(int)); double* newCentroids = (double*)checkedCalloc(args>shared>K * args>shared>pointDim, sizeof(double)); // Compute the assignment counts from the work the threads did. for (int t = 0; t < args>shared>numThreads; ++t) for (int k = 0; k < args>shared>K; ++k) assignmentCounts[k] += args>shared>local[t].labelCount[k]; // Compute the location of the new centroids based on the work that the // threads did. for (int t = 0; t < args>shared>numThreads; ++t) for (int k = 0; k < args>shared>K; ++k) for (int dim = 0; dim < args>shared>pointDim; ++dim) newCentroids[k * args>shared>pointDim + dim] += args>shared>local[t].partialCentroids[k * args>shared>pointDim + dim]; for (int k = 0; k < args>shared>K; ++k) for (int dim = 0; dim < args>shared>pointDim; ++dim) newCentroids[k * args>shared>pointDim + dim] /= assignmentCounts[k]; // See by how much did the position of the centroids changed. args>shared>change = 0.0; for (int k = 0; k < args>shared>K; ++k) for (int dim = 0; dim < args>shared>pointDim; ++dim) { double d = args>shared>centroids[k * args>shared>pointDim + dim]  newCentroids[k * args>shared>pointDim + dim]; args>shared>change += d * d; } // Store the new centroid locations into the centroid output. memcpy(args>shared>centroids, newCentroids, sizeof(double) * args>shared>pointDim * args>shared>K); // Decide if the loop should continue or terminate. (max iterations // exceeded, or relative change not exceeded.) args>shared>continueLoop = args>shared>change > 0.001 * args>shared>pChange && (args>shared>maxIter) > 0; args>shared>pChange = args>shared>change; free(assignmentCounts); free(newCentroids); }
Each individual thread follows the same specification as given in the parallel algorithm section, and follows the calling convention required by the Windows API.
DWORD WINAPI assignThread(LPVOID data) { LocalAssignData* args = (LocalAssignData*)data; while (args>shared>continueLoop) { memset(args>labelCount, 0, sizeof(int) * args>shared>K); // Assign points cluster labels assign(args>shared>labels, args>begin, args>end, args>labelCount, args>shared>K, args>shared>points, args>shared>pointDim, args>shared>centroids, args>partialCentroids); // Tell the last thread to enter here to aggreagate the data within a // critical section synchronizeBarrier(&(args>shared>barrier), aggregate, args); }; return 0; }
The parallel algorithm controller itself is fairly simple and is responsible for basic preparation, bookkeeping, and cleanup. The number of processors is used to determine the number of threads to launch. The calling thread will run one instance will the remaining instances will run on separate threads. The data is partitioned, then the threads are spawned using the CreateThread
routine. I wish there was a Windows API that would allow me to simultaneously create threads with a specified array of arguments because CreateThread will automatically start the thread as soon as it’s created. If lots of threads are being created, then the first will wait a long time before the last one gets around to reaching the barrier. Subsequent iterations of the synchronized loops will have better performance, but it would be nice to avoid that initial delay. After kicking off the threads, the main thread will run its own block of data, and once all threads terminate, the routine will close open handles and free allocated memory.
void kMeansFitParallel(double* points, int numPoints, int pointDim, int K, double* centroids) { // Lookup and calculate all the threading related values. SYSTEM_INFO systemInfo; GetSystemInfo(&systemInfo); DWORD numProcessors = systemInfo.dwNumberOfProcessors; DWORD numThreads = numProcessors  1; DWORD pointsPerProcessor = numPoints / numProcessors; // Prepare the shared arguments that will get passed to each thread. SharedAssignData shared; shared.numPoints = numPoints; shared.pointDim = pointDim; shared.K = K; shared.points = points; shared.continueLoop = true; shared.maxIter = 1000; shared.pChange = 0.0; shared.change = 0.0; shared.numThreads = numThreads; shared.numProcessors = numProcessors; initializeBarrier(&(shared.barrier), numProcessors); shared.centroids = centroids; for (int i = 0; i < K; ++i) { int point = rand() % numPoints; for (int dim = 0; dim < pointDim; ++dim) shared.centroids[i * pointDim + dim] = points[point * pointDim + dim]; } shared.labels = (int*)checkedCalloc(numPoints, sizeof(int)); // Create thread workload descriptors LocalAssignData* local = (LocalAssignData*)checkedCalloc(numProcessors, sizeof(LocalAssignData)); for (int i = 0; i < numProcessors; ++i) { local[i].shared = &shared; local[i].begin = i * pointsPerProcessor; local[i].end = min((i + 1) * pointsPerProcessor, numPoints); local[i].labelCount = (int*)checkedCalloc(K, sizeof(int)); local[i].partialCentroids = (double*)checkedCalloc(K * pointDim, sizeof(double)); } shared.local = local; // Kick off the threads HANDLE* threads = (HANDLE*)checkedCalloc(numThreads, sizeof(HANDLE)); for (int i = 0; i < numThreads; ++i) threads[i] = CreateThread(0, 0, assignThread, &local[i + 1], 0, NULL); // Do work on this thread so that it's just not sitting here idle while the // other threads are doing work. assignThread(&local[0]); // Clean up WaitForMultipleObjects(numThreads, threads, true, INFINITE); for (int i = 0; i < numThreads; ++i) CloseHandle(threads[i]); free(threads); for (int i = 0; i < numProcessors; ++i) { free(local[i].labelCount); free(local[i].partialCentroids); } free(local); free(shared.labels); deleteBarrier(&(shared.barrier)); }
C#
The CUDAfy.NET GPGPU C# implementation required a lot of experimentation to find an efficient solution.
In the GPGPU paradigm there is a host and a device in which sequential operations take place on the host (ie. managed C# code) and parallel operations on the device (ie. CUDA code). To delineate between the two, the [Cudafy]
method attribute is used on the static public method assign
. The set of host operations are all within the Fit
routine.
Under the CUDA model, threads are bundled together into blocks, and blocks together into a grid. Here the data is partitioned so that each block consists of half the maximum number of threads possible per block and the total number of blocks is the number of points divided by that quantity. This was done through experimentation, and motivated by Thomas Bradley’s Advanced CUDA Optimization workshop notes [pdf] that suggest at that regime the memory lines become saturated and cannot yield better throughput. Each block runs on a Streaming Multiprocessor (a collection of CUDA cores) having shared memory that the threads within the block can use. These blocks are then executed in pipeline fashion on the available Streaming Multiprocessors to give the desired performance from the GPGPU.
What is nice about the shared memory is that it is much faster than the global memory of the GPGPU. (cf. Using Shared Memory in CUDA C/C++) To make use of this fact the threads will rely on two arrays in shared memory: sum of the points and the count of those belonging to each centroid. Once the arrays have been zeroed out by the threads, all of the threads will proceed to find the nearest centroid of the single point they are assigned to and then update those shared arrays using the appropriate atomic operations. Once all of the threads complete that assignment, the very first thread will then add the arrays in shared memory to those in the global memory using the appropriate atomic operations.
using Cudafy; using Cudafy.Host; using Cudafy.Translator; using Cudafy.Atomics; using System; namespace CUDAfyTesting { public class CUDAfyKMeans { [Cudafy] public static void assign(GThread thread, int[] constValues, double[] centroids, double[] points, float[] outputSums, int[] outputCounts) { // Unpack the const value array int pointDim = constValues[0]; int K = constValues[1]; int numPoints = constValues[2]; // Ensure that the point is within the boundaries of the points // array. int tId = thread.threadIdx.x; int point = thread.blockIdx.x * thread.blockDim.x + tId; if (point >= numPoints) return; // Use two shared arrays since they are much faster than global // memory. The shared arrays will be scoped to the block that this // thread belongs to. // Accumulate the each point's dimension assigned to the k'th // centroid. When K = 128 => pointDim = 2; when pointDim = 128 // => K = 2; Thus max(len(sharedSums)) = 256. float[] sharedSums = thread.AllocateShared<float>("sums", 256); if (tId < K * pointDim) sharedSums[tId] = 0.0f; // Keep track of how many times the k'th centroid has been assigned // to a point. max(K) = 128 int[] sharedCounts = thread.AllocateShared<int>("counts", 128); if (tId < K) sharedCounts[tId] = 0; // Make sure all threads share the same shared state before doing // any calculations. thread.SyncThreads(); // Find the optCentroid for point. double optDist = double.PositiveInfinity; int optCentroid = 1; for (int centroid = 0; centroid < K; ++centroid) { double dist = 0.0; for (int dim = 0; dim < pointDim; ++dim) { double d = centroids[centroid * pointDim + dim]  points[point * pointDim + dim]; dist += d * d; } if (dist < optDist) { optDist = dist; optCentroid = centroid; } } // Add the point to the optCentroid sum for (int dim = 0; dim < pointDim; ++dim) // CUDA doesn't support double precision atomicAdd so cast down // to float... thread.atomicAdd(ref(sharedSums[optCentroid * pointDim + dim]), (float)points[point * pointDim + dim]); // Increment the optCentroid count thread.atomicAdd(ref(sharedCounts[optCentroid]), +1); // Wait for all of the threads to complete populating the shared // memory before storing the results back to global memory where // the host can access the results. thread.SyncThreads(); // Have to do a lock on both of these since some other Streaming // Multiprocessor could be running and attempting to update the // values at the same time. // Copy the shared sums to the output sums if (tId == 0) for (int i = 0; i < K * pointDim; ++i) thread.atomicAdd(ref(outputSums[i]), sharedSums[i]); // Copy the shared counts to the output counts if (tId == 0) for (int i = 0; i < K; i++) thread.atomicAdd(ref(outputCounts[i]), sharedCounts[i]); }
Before going on to the Fit
method, let’s look at what CUDAfy.NET is doing under the hood to convert the C# code to run on the CUDAenabled GPGPU. Within the CUDAfy.Translator
namespace there are a handful of classes for decompiling the application into an abstract syntax tree using ICharpCode.Decompiler
and Mono.Cecil
, then converting the AST over to CUDA C via visitor pattern, next compiling the resulting CUDA C using NVIDIA’s NVCC compiler, and finally the compilation result is relayed back to the caller if there’s a problem; otherwise, a CudafyModule
instance is returned, and the compiled CUDA C code it represents loaded up on the GPGPU. (The classes and method calls of interest are: CudafyTranslator.DoCudafy
, CudaLanguage.RunTransformsAndGenerateCode
, CUDAAstBuilder.GenerateCode
, CUDAOutputVisitor
and CudafyModule.Compile
.)
private CudafyModule cudafyModule; private GPGPU gpgpu; private GPGPUProperties properties; public int PointDim { get; private set; } public double[] Centroids { get; private set; } public CUDAfyKMeans() { cudafyModule = CudafyTranslator.Cudafy(); gpgpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId); properties = gpgpu.GetDeviceProperties(true); gpgpu.LoadModule(cudafyModule); }
The Fit
method follows the same paradigm that I presented earlier with the C++ code. The main difference here is the copying of managed .NET resources (arrays) over to the device. I found these operations to be relatively time intensive and I did find some suggestions from the CUDAfy.NET website on how to use pinned memory essentially copy the managed memory to unmanaged memory, then do an asynchronous transfer from the host to the device. I tried this with the points arrays since its the largest resource, but did not see noticeable gains so I left it as is.
At the beginning of each iteration of the main loop, the device counts and sums are cleared out through the Set
method, then the CUDA code is invoked using the Launch
routine with the specified block and grid dimensions and device pointers. One thing that the API does is return an array when you allocate or copy memory over to the device. Personally, an IntPtr
seems more appropriate. Execution of the routine is very quick, where on some of my tests it took 1 to 4 ms to process 100,000 two dimensional points. Once the routine returns, memory from the device (sum and counts) is copied back over to the host which then does a quick operation to derive the new centroid locations and copy that memory over to the device for the next iteration.
public void Fit(double[] points, int pointDim, int K) { if (K <= 0) throw new ArgumentOutOfRangeException("K", "Must be greater than zero."); if (pointDim <= 0) throw new ArgumentOutOfRangeException("pointDim", "Must be greater than zero."); if (points.Length < pointDim) throw new ArgumentOutOfRangeException("points", "Must have atleast pointDim entries."); if (points.Length % pointDim != 0) throw new ArgumentException("points.Length must be n * pointDim > 0."); int numPoints = points.Length / pointDim; // Figure out the partitioning of the data. int threadsPerBlock = properties.MaxThreadsPerBlock / 2; int numBlocks = (numPoints / threadsPerBlock) + (numPoints % threadsPerBlock > 0 ? 1 : 0); dim3 blockSize = new dim3(threadsPerBlock, 1, 1); dim3 gridSize = new dim3( Math.Min(properties.MaxGridSize.x, numBlocks), Math.Min(properties.MaxGridSize.y, (numBlocks / properties.MaxGridSize.x) + (numBlocks % properties.MaxGridSize.x > 0 ? 1 : 0)), 1 ); int[] constValues = new int[] { pointDim, K, numPoints }; float[] assignmentSums = new float[pointDim * K]; int[] assignmentCount = new int[K]; // Initial centroid locations picked at random Random prng = new Random(); double[] centroids = new double[K * pointDim]; for (int centroid = 0; centroid < K; centroid++) { int point = prng.Next(points.Length / pointDim); for (int dim = 0; dim < pointDim; dim++) centroids[centroid * pointDim + dim] = points[point * pointDim + dim]; } // These arrays are only read from on the GPU they are never written // on the GPU. int[] deviceConstValues = gpgpu.CopyToDevice<int>(constValues); double[] deviceCentroids = gpgpu.CopyToDevice<double>(centroids); double[] devicePoints = gpgpu.CopyToDevice<double>(points); // These arrays are written written to on the GPU. float[] deviceSums = gpgpu.CopyToDevice<float>(assignmentSums); int[] deviceCount = gpgpu.CopyToDevice<int>(assignmentCount); // Set up main loop so that no more than maxIter iterations take // place, and that a realative change less than 1% in centroid // positions will terminate the loop. int maxIter = 1000; double change = 0.0, pChange = 0.0; do { pChange = change; // Clear out the assignments, and assignment counts on the GPU. gpgpu.Set(deviceSums); gpgpu.Set(deviceCount); // Lauch the GPU portion gpgpu.Launch(gridSize, blockSize, "assign", deviceConstValues, deviceCentroids, devicePoints, deviceSums, deviceCount); // Copy the results memory from the GPU over to the CPU. gpgpu.CopyFromDevice<float>(deviceSums, assignmentSums); gpgpu.CopyFromDevice<int>(deviceCount, assignmentCount); // Compute the new centroid locations. double[] newCentroids = new double[centroids.Length]; for (int centroid = 0; centroid < K; ++centroid) for (int dim = 0; dim < pointDim; ++dim) newCentroids[centroid * pointDim + dim] = assignmentSums[centroid * pointDim + dim] / assignmentCount[centroid]; // Calculate how much the centroids have changed to decide // whether or not to terminate the loop. change = 0.0; for (int centroid = 0; centroid < K; ++centroid) for (int dim = 0; dim < pointDim; ++dim) { double d = newCentroids[centroid * pointDim + dim]  centroids[centroid * pointDim + dim]; change += d * d; } // Update centroid locations on CPU & GPU Array.Copy(newCentroids, centroids, newCentroids.Length); deviceCentroids = gpgpu.CopyToDevice<double>(centroids); } while (change > 0.01 * pChange && maxIter > 0); gpgpu.FreeAll(); this.Centroids = centroids; this.PointDim = pointDim; } } }
Python
I include the Python implementation for the sake of demonstrating how scikitlearn was invoked throughout the following experiments section.
model = KMeans( n_clusters = numClusters, init='random', n_init = 1, max_iter = 1000, tol = 1e3, precompute_distances = False, verbose = 0, copy_x = False, n_jobs = numThreads ); model.fit(X); // X = (numPoints, pointDim) numpy array.
Experimental Setup
All experiments where conducted on a laptop with an Intel Core i72630QM Processor and NVIDIA GeForce GT 525M GPGPU running Windows 7 Home Premium. C++ and C# implementations were developed and compiled by Microsoft Visual Studio Express 2013 for Desktop targeting C# .NET Framework 4.5 (Release, Mixed Platforms) and C++ (Release, Win32). Python implementation was developed and compiled using Eclipse Luna 4.4.1 targeting Python 2.7, scikitlearn 0.16.0, and numpy 1.9.1. All compilers use default arguments and no extra optimization flags.
For each test, each reported test point is the median of thirty sample run times of a given algorithm and set of arguments. Run time is computed as the (wall) time taken to execute model.fit(points, pointDim, numClusters)
where time is measured by: QueryPerformanceCounter
in C++, System.Diagnostics.Stopwatch
in C#, and time.clock
in Python. Every test is based on a dataset having two natural clusters at .25 or .25 in each dimension.
Results
Varying point quantity
Both the C++ and C# sequential and parallel implementations outperform the Python scikitlearn implementations. However, the C++ sequential and parallel implementations outperforms their C# counterparts. Though the C++ sequential and parallel implementations are tied, as it seems the overhead associated with multithreading overrides any multithreaded performance gains one would expect. The C# CUDAfy.NET implementation surprisingly does not outperform the C# parallel implementation, but does outperform the C# sequential one as the number of points to cluster increases.
So what’s the deal with Python scikitlearn? Why is the parallel version so slow? Well, it turns out I misunderstood the nJobs parameter. I interpreted this to mean that process of clustering a single set of points would be done in parallel; however, it actually means that the number of simultaneous runs of the whole process will occur in parallel. I was tipped off to this when I noticed multiple python.exe fork processes being spun off which surprised me that someone would implement a parallel routine that way leading to a more thorough reading the scikitlearn documentation. There is parallelism going on with scikitlearn, just not the desired type. Taking that into account the linear one performs reasonably well for being a dynamically typed interpreted language.
Varying point dimension
The C++ and C# parallel implementations exhibit consistent improved run time over their sequential counterparts. In all cases the performance is better than scikitlearn’s. Surprisingly, the C# CUDAfy.NET implementation does worse than both the C# sequential and parallel implementations. Why do we not better CUDAfy.NET performance? The performance we see is identical to the vary point quantity test. So on one hand it’s nice that increasing the point dimensions did not dramatically increase the run time, but ideally, the CUDAfy.NET performance should be better than the sequential and parallel C# variants for this test. My leading theory is that higher point dimensions result in more data that must be transferred between host and device which is a relatively slow process. Since I’m short on time, this will have to be something I investigate in more detail in the future.
Varying cluster quantity
As in the point dimension test, the C++ and C# parallel implementations outperform their sequential counterparts, while the scikitlearn implementation starts to show some competitive performance. The exciting news of course is that varying the cluster size finally reveals improved C# CUDAfy.NET run time. Now there is some curious behavior at the beginning of each plot. We get performance for two clusters, then jump up into about for four to eight clusters. Number of points and their dimension are held constant, but we allocate a few extra double’s for the cluster centroids. I believe this has to do with cache behavior. I’m assuming for fewer than four clusters everything that’s needed sits nicely in the fast L1 cache, and moving up to four and more clusters requires more exchanging of data between L1, L2, L3, and (slower) memory memory to the different cores of the Intel Core i72630QM processor I’m using. As before, I’ll need to do some more tests to verify that this is what is truly happening.
Language comparison
For the three tests considered, the C++ implementations gave the best run time performance on point quantity and point dimension tests while the C# CUDAfy.NET implementation gave the best performance on the cluster quantity test.
The C++ implementation could be made to run faster be preallocating memory in the same fashion that C# does. In C# when an application is first created a block of memory is allocated for the managed heap. As a result, allocation of reference types in C# is done by incrementing a pointer instead of doing an unmanaged allocation (malloc, etc.). (cf. Automatic Memory Management) This allocation takes place before executing the C# routines, while the same allocation takes place during the C++ routines. Hence, the C++ run times will have an overhead not present in the C# run times. Had I implemented memory allocation in C++ the same as it’s done in C#, then the C++ implementation would be undoubtedly even faster than the C# ones.
While using scikitlearn in Python is convenient for exploratory data analysis and prototyping machine learning algorithms, it leaves much to be desired in performance; frequently coming ten times slower than the other two implementations on the varying point quantity and dimension tests, but within tolerance on the vary cluster quantity tests.
Future Work
The algorithmic approach here was to parallelize work on data points, but as the dimension of each point increases, it may make sense to explore algorithms that parallelize work across dimensions instead of points.
I’d like to spend more time figuring out some of the highperformance nuances of programming the GPGPU (as well as traditional C++), which take more time and patience than a week or two I spent on this. In addition, I’d like to dig a little deeper into doing CUDA C directly rather than through the convenient CUDAfy.NET wrapper; as well as explore OpenMP and OpenCL to see how they compare from a development and performanceoriented view to CUDA.
Python and scikitlearn were used a baseline here, but it would be worth spending extra time to see how R and Julia compare, especially the latter since Julia pitches itself as a highperformance solution, and is used for exploratory data analysis and prototyping machine learning systems.
While the emphasis here was on trying out CUDAfy.NET and getting some exposure to GPGPU programming, I’d like to apply CUDAfy.NET to the expectation maximization algorithm for fitting multivariate Gaussian mixture models to a dataset. GMMs are a natural extension of kmeans clustering, and it will be good to implement the more involved EM algorithm.
Conclusions
Through this exercise, we can expect to see modest speedups over sequential implementations of about 2.62x and 11.69x in the C# parallel and GPGPU implementations respectively when attempting to find large numbers of clusters on low dimensional data. Fortunately the way you use kmeans clustering is to find the cluster quantity that maximizes the Bayesian information criterion or Akaike information criterion which means running the vary centroid quantity test on real data. On the other hand, most machine learning data is of a high dimension so further testing (on a real data set) would be needed to verify it’s effectiveness in a production environment. Nonetheless, we’ve seen how parallel and GPGPU based approaches can reduce the time it takes to complete the clustering task, and learned some things along the way that can be applied to future work.
Bibliography
[LiFa89] Li Xiaobo and Fang Zhixi, “Parallel clustering algorithms”, Parallel Computing, 1989, 11(3): pp.275290.
[MaMi09] Mario Zechner, Michael Granitzer. “Accelerating KMeans on the Graphics Processor via CUDA.” First International Conference on Intensive Applications and Services, INTENSIVEâ€™09. pp. 715, 2009.
[Stu82] Stuart P. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28:129137, 1982.
Abstract Algebra in C#
Motivation
In C++ it is easy to define arbitrary template methods for computations involving primitive numeric types because all types inherently have arithmetic operations defined. Thus, a programmer need only implement one method for all numeric types. The compiler will infer the use and substitute the type at compile time and emit the appropriate machine instructions. This is C++’s approach to parametric polymorphism.
With the release of C# 2.0 in the fall of 2005, the language finally got a taste of parametric polymorphism in the form of generics. Unfortunately, types in C# do not inherently have arithmetic operations defined, so methods involving computations must use adhoc polymorphism to achieve the same result as in C++. The consequence is a greater bloat in code and an increased maintenance liability.
To get around this design limitation, I’ve decided to leverage C#’s approach to subtype polymorphism and to draw from Abstract Algebra to implement a collection of interfaces allowing for C++ like template functionality in C#. The following is an overview of the mathematical theory used to support and guide the design of my solution. In addition, I will present example problems from mathematics and computer science that can be represented in this solution along with examples how type agnostic computations that can be performed using this solution.
Abstract Algebra
Abstract Algebra is focused on how different algebraic structures behave in the presence of different axioms, operations and sets. In the following three sections, I will go over the fundamental subfields and how they are represented under the solution.
In all three sections, I will represent the distinction between algebraic structures using C# interfaces. The type parameters on these interfaces represent the sets being acted upon by each algebraic structure. This convention is consistent with intuitionistic (i.e., Chruchstyle) type theory embraced by C#’s Common Type System (CTS). Use of parameter constraints will be used when type parameters are intended to be of a specific type. Functions on the set and elements of the set will be represented by methods and properties respectively.
Group Theory
Group Theory is the simplest of subfields of Abstract Algebra dealing with the study of a single binary operation, , acting on a set . There are five axioms used to describe the structures studied under Group Theory:
 Closure:
 Associativity:
 Commutativity :
 Identity:
 Inverse:
The simplest of these structures is the Groupoid satisfying only axiom (1). Any Groupoid also satisfying axiom (2) is known as a Semigroup. Any Semigroup satisfying axiom (4) is a Monoid. Monoid’s also satisfying axiom (5) are known as Groups. Any Group satisfying axiom (3) is an Abelian Group.
public interface IGroupoid<T> { T Operation(T a, T b); } public interface ISemigroup<T> : IGroupoid<T> { } public interface IMonoid<T> : ISemigroup<T> { T Identity { get; } } public interface IGroup<T> : IMonoid<T> { T Inverse(T t); } public interface IAbelianGroup<T> : IGroup<T> { }
Ring Theory
The next logical subfield of Abstract Algebra to study is Ring Theory which is the study of two operations, and , on a single set. In addition to the axioms outlined above, there is an addition axiom for describing how one operations distributes over the other.
 Distributivity:
All of the following ring structures satisfy axiom (6). Rings are distinguished by the properties of their operands. The simplest of these structures is the Ringoid where both operands are given by Groupoids. Any Ringoid whose operands are Semigroups is a Semiring. Any Semiring whose first operand is a Group is a Ring. Any Ring whose second operand is a Monoid is a Ring with Unity. Any Ring with Unity whose second operand is a Group is Division Ring. Any Division Ring whose operands are both Abelian Groups is a Field.
public interface IRingoid<T, A, M> where A : IGroupoid<T> where M : IGroupoid<T> { A Addition { get; } M Multiplication { get; } T Distribute(T a, T b); } public interface ISemiring<T, A, M> : IRingoid<T, A, M> where A : ISemigroup<T> where M : ISemigroup<T> { } public interface IRing<T, A, M> : ISemiring<T, A, M> where A : IGroup<T> where M : ISemigroup<T> { } public interface IRingWithUnity<T, A, M> : IRing<T, A, M> where A : IGroup<T> where M : IMonoid<T> { } public interface IDivisionRing<T, A, M> : IRingWithUnity<T, A, M> where A : IGroup<T> where M : IGroup<T> { } public interface IField<T, A, M> : IDivisionRing<T, A, M> where A : IAbelianGroup<T> where M : IAbelianGroup<T> { }
Module Theory
The last, and more familiar, subfield of Abstract Algebra is Module Theory which deals with structures with an operation, , over two separate sets: and that satisfy the following axioms.
 Distributivity of :
 Distributivity of :
 Associativity of :
All of the following module structures satisfy axioms (7)(9). A Module consists of a scalar Ring and an vector Abelian Group. Any Module whose Ring is a Ring with Unity is a Unitary Module. Any Unitary Module whose Ring with Unity is a Abelian Group is a Vector Space.
public interface IModule< TScalar, TVector, TScalarRing, TScalarAddativeGroup, TScalarMultiplicativeSemigroup, TVectorAddativeAbelianGroup > where TScalarRing : IRing<TScalar, TScalarAddativeGroup, TScalarMultiplicativeSemigroup> where TScalarAddativeGroup : IGroup<TScalar> where TScalarMultiplicativeSemigroup : ISemigroup<TScalar> where TVectorAddativeAbelianGroup : IAbelianGroup<TVector> { TScalarRing Scalar { get; } TVectorAddativeAbelianGroup Vector { get; } TVector Distribute(TScalar t, TVector r); } public interface IUnitaryModule< TScalar, TVector, TScalarRingWithUnity, TScalarAddativeGroup, TScalarMultiplicativeMonoid, TVectorAddativeAbelianGroup > : IModule< TScalar, TVector, TScalarRingWithUnity, TScalarAddativeGroup, TScalarMultiplicativeMonoid, TVectorAddativeAbelianGroup > where TScalarRingWithUnity : IRingWithUnity<TScalar, TScalarAddativeGroup, TScalarMultiplicativeMonoid> where TScalarAddativeGroup : IGroup<TScalar> where TScalarMultiplicativeMonoid : IMonoid<TScalar> where TVectorAddativeAbelianGroup : IAbelianGroup<TVector> { } public interface IVectorSpace< TScalar, TVector, TScalarField, TScalarAddativeAbelianGroup, TScalarMultiplicativeAbelianGroup, TVectorAddativeAbelianGroup > : IUnitaryModule< TScalar, TVector, TScalarField, TScalarAddativeAbelianGroup, TScalarMultiplicativeAbelianGroup, TVectorAddativeAbelianGroup > where TScalarField : IField<TScalar, TScalarAddativeAbelianGroup, TScalarMultiplicativeAbelianGroup> where TScalarAddativeAbelianGroup : IAbelianGroup<TScalar> where TScalarMultiplicativeAbelianGroup : IAbelianGroup<TScalar> where TVectorAddativeAbelianGroup : IAbelianGroup<TVector> { }
Representation of Value Types
The CTS allows for both value and reference types on the .NET Common Language Infrastructure (CLI). The following are examples of how each theory presented above can leverage value types found in the C# language to represent concepts drawn from mathematics.
Enum Value Types and the Dihedral Group
One of the simplest finite groups is the Dihedral Group of order eight, , representing the different orientations of a square, , obtained by reflecting the square about the vertical axis, , and rotating the square by ninety degrees, . The generating set is given by and gives rise to the set These elements are assigned names as follows: , , , , , , and . The relationship between these elements is visualized below:
The easiest way to represent this group as a value type is with an enum.
enum Symmetry { Rot000, Rot090, Rot180, Rot270, RefVer, RefDes, RefHoz, RefAsc }
From this enum we can define the basic Group Theory algebraic structures to take us to .
public class SymmetryGroupoid : IGroupoid<Symmetry> { public Symmetry Operation(Symmetry a, Symmetry b) { // 64 cases } } public class SymmetrySemigroup : SymmetryGroupoid, ISemigroup<Symmetry> { } public class SymmetryMonoid : SymmetrySemigroup, IMonoid<Symmetry> { public Symmetry Identity { get { return Symmetry.Rot000; } } } public class SymmetryGroup : SymmetryMonoid, IGroup<Symmetry> { public Symmetry Inverse(Symmetry a) { switch (a) { case Symmetry.Rot000: return Symmetry.Rot000; case Symmetry.Rot090: return Symmetry.Rot270; case Symmetry.Rot180: return Symmetry.Rot270; case Symmetry.Rot270: return Symmetry.Rot090; case Symmetry.RefVer: return Symmetry.RefVer; case Symmetry.RefDes: return Symmetry.RefAsc; case Symmetry.RefHoz: return Symmetry.RefHoz; case Symmetry.RefAsc: return Symmetry.RefDes; } throw new NotImplementedException(); } }
Integral Value Types and the Commutative Ring with Unity over
C# exposes a number of fixed bit integral value types that allow a programmer to pick an integral value type suitable for the scenario at hand. Operations over these integral value types form a commutative ring with unity whose set is the congruence class where is the number of bits used to represent the integer and is the equivalance class with .
Addition is given by just as multiplication is given by . Both statements are equivalent to the following congruence statements: and respectively.
Under the binary numeral system, modulo is equivalent to ignoring the bits exceeding , or equivalently, where . As a result only the first (right most) bits need to be considered when computing the sum or product of two congruence classes, or in this case, integer values in C#. Thus, in the following implementation, it is not necessary to write any extra code to represent these operations other than writing them in their native form.
The reason why we are limited to a commutative ring with unity instead of a full field is that multiplicative inverses do not exist for all elements. A multiplicative inverse only exists when where is the multiplicative inverse of . For a solution to exist, . Immediately, any even value of will not have a multiplicative inverse in . However, all odd numbers will.
public class AddativeIntegerGroupoid : IGroupoid<long> { public long Operation(long a, long b) { return a + b; } } public class AddativeIntegerSemigroup : AddativeIntegerGroupoid, ISemigroup<long> { } public class AddativeIntegerMonoid : AddativeIntegerSemigroup, IMonoid<long> { public long Identity { get { return 0L; } } } public class AddativeIntegerGroup : AddativeIntegerMonoid, IGroup<long> { public long Inverse(long a) { return a; } } public class AddativeIntegerAbelianGroup : AddativeIntegerGroup, IAbelianGroup<long> { } public class MultiplicativeIntegerGroupoid : IGroupoid<long> { public long Operation(long a, long b) { return a * b; } } public class MultiplicativeIntegerSemigroup : MultiplicativeIntegerGroupoid, ISemigroup<long> { } public class MultiplicativeIntegerMonoid : MultiplicativeIntegerSemigroup, IMonoid<long> { public long Identity { get { return 1L; } } } public class IntegerRingoid : IRingoid<long, AddativeIntegerGroupoid, MultiplicativeIntegerGroupoid> { public AddativeIntegerGroupoid Addition { get; private set;} public MultiplicativeIntegerGroupoid Multiplication { get; private set;} public IntegerRingoid() { Addition = new AddativeIntegerGroupoid(); Multiplication = new MultiplicativeIntegerGroupoid(); } public long Distribute(long a, long b) { return Multiplication.Operation(a, b); } } public class IntegerSemiring : IntegerRingoid, ISemiring<long, AddativeIntegerSemiring, MultiplicativeIntegerSemiring> { public AddativeIntegerSemiring Addition { get; private set;} public MultiplicativeIntegerSemiring Multiplication { get; private set;} public IntegerSemiring() : base() { Addition = new AddativeIntegerSemiring(); Multiplication = new MultiplicativeIntegerSemiring(); } } public class IntegerRing : IntegerSemiring, IRing<long, AddativeIntegerGroup, MultiplicativeIntegerSemigroup>{ public new AddativeIntegerGroup Addition { get; private set; } public IntegerRing() : base() { Addition = new AddativeIntegerGroup(); } } public class IntegerRingWithUnity : IntegerRing, IRingWithUnity<long, AddativeIntegerGroup, MultiplicativeIntegerMonoid> { public MultiplicativeIntegerMonoid Multiplication { get; private set; } public IntegerRingWithUnity() : base() { Multiplication = new MultiplicativeIntegerMonoid(); } }
Floatingpoint Value Types and the Real Vector Space
C# offers three types that approximate the set of Reals: floats, doubles and decimals. Floats are the least representative followed by doubles and decimals. These types are obviously not continuous, but the error involved in rounding calculations with respect to the calculations in question are negligible and for most intensive purposes can be treated as continuous.
As in the previous discussion on the integers, additive and multiplicative classes are defined over the algebraic structures defined in the Group and Ring Theory sections presented above. In addition to these implementations, an additional class is defined to describe a vector.
public class Vector<T> { private T[] vector; public int Dimension { get { return vector.Length; } } public T this[int n] { get { return vector[n]; } set { vector[n] = value; } } public Vector() { vector = new T[2]; } }
With these classes, it is now possible to implement the algebraic structures presented in the Module Theory section from above.
public class RealVectorModule : IModule<double, Vector<double>, RealRing, AddativeRealGroup, MultiplicativeRealSemigroup, VectorAbelianGroup<double>> { public RealRing Scalar { get; private set; } public VectorAbelianGroup<double> Vector { get; private set; } public RealVectorModule() { Scalar = new RealRing(); Vector = new VectorAbelianGroup<double>(new AddativeRealAbelianGroup()); } public Vector<double> Distribute(double t, Vector<double> r) { Vector<double> c = new Vector<double>(); for (int i = 0; i < c.Dimension; i++) c[i] = Scalar.Multiplication.Operation(t, r[i]); return c; } } public class RealVectorUnitaryModule : RealVectorModule, IUnitaryModule<double, Vector<double>, RealRingWithUnity, AddativeRealGroup, MultiplicativeRealMonoid, VectorAbelianGroup<double>> { public new RealRingWithUnity Scalar { get; private set; } public RealVectorUnitaryModule() : base() { Scalar = new RealRingWithUnity(); } } public class RealVectorVectorSpace : RealVectorUnitaryModule, IVectorSpace<double, Vector<double>, RealField, AddativeRealAbelianGroup, MultiplicativeRealAbelianGroup, VectorAbelianGroup<double>> { public new RealField Scalar { get; private set; } public RealVectorVectorSpace() : base() { Scalar = new RealField(); } }
Representation of Reference Types
The following are examples of how each theory presented above can leverage reference types found in the C# language to represent concepts drawn from computer science.
Strings, Computability and Monoids
Strings are the simplest of reference types in C#. From an algebraic structure point of view, the set of possible strings, , generated by an alphabet, , and paired with a concatenation operation, , forms a monoid.
public class StringGroupoid : IGroupoid<string> { public string Operation(String a, String b) { return string.Format("{0}{1}", a, b); } } public class StringSemigroup : StringGroupoid, ISemigroup<string> { } public class StringMonoid : StringSemigroup, IMonoid<string> { public string Identity { get { return string.Empty; } } }
Monoids over strings have a volley of applications in the theory of computation. Syntactic Monoids describe the smallest set that recognizes a formal language. Trace Monoids describe concurrent programming by allowing different characters of an alphabet to represent different types of locks and synchronization points, while the remaining characters represent processes.
Classes, Interfaces, Type Theory and Semirings
Consider the set of types , that are primitive and constructed in C#. The generating set of is the set of primitive reference and value types, , consisting of the types discussed thus far. New types can be defined by defining classes and interfaces.
A simple operation over takes two types, , and yields a third type, , known as a sum type. In type theory, this means that an instance of can be either an instance of or . A second operation over takes two types and yields a third type representing a tuple of the first two types. In other words, .
Both operations form a semigroup and and in conjunction the two form a semiring.
To implement this semiring is a little involved. The .NET library supports emitting dynamic type definitions at runtime. For sum types, this would lead to an inheritance view of the operation. Types and would end up deriving from which means that any sequence of sums would yield an inheritance tree. A product type would result in composition of types with projection operations, , to access and assign the element of the composite. Both type operation implementations are outside the scope of this writeup and I’ll likely revisit this topic in a future writeup.
Delegates and Process Algebras
The third type of reference type to mention is the delegate type which is C#’s approach to creating firstclass functions. The simplest of delegates is the builtin Action delegate which represents a single procedure taking no inputs and returning no value.
Given actions , we can define a possible execution operator, , where either or is executed denoted as . The choice operation forms a commutative semigroup since operations are associative and the operation is commutative .
A product operation, , representing the sequential execution of and then is given by . The sequence operator forms a groupoid with unity since the operation is not associative and there is an identity action representing a void operation resulting in .
Both operations together form a ringoid, since the sequence operation distributes over the choice operation . Meaning that takes place and then or takes places is equivalent to and then takes place or and then takes place.
public class SequenceGroupoidWithUnity<Action> : IGroupoid<Action> { public Action Identity { get { return () => {}; } } public Action Operation(Action a, Action b) { return () => { a(); b(); } } } public class ChoiceGroupoid<Action> : IGroupoid<Action> { public Action Operation(Action a, Action b) { if(DateTime.Now.Ticks % 2 == 0) return a; return b; } }
The process algebra an be extended further to describe parallel computations with an additional operation. The operations given thus far enable one to derive the possible execution paths in a process. This enables one to comprehensively test each execution path to achieve complete test coverage.
Examples
The motivation of this work was to achieve C++’s approach to parametric polymorphism by utilizing C# subtype polymorphism to define the algebraic structure required by a method (akin to the builtin operations on types in C++). To illustrate how these interfaces are to be used, the following example extension methods operate over a collection of a given type and accept the minimal algebraic structure to complete the computation. The result is a single implementation of the calculation that one would expect in C++.
static public class GroupExtensions { static public T Sum<T>(this IEnumerable<T> E, IMonoid<T> m) { return E .FoldL(m.Identity, m.Operation); } } static public class RingoidExtensions { static public T Count<T, R, A, M>(this IEnumerable<R> E, IRingWithUnity<T, A, M> r) where A : IGroup<T> where M : IMonoid<T> { return E .Map((x) => r.Multiplication.Identity) .Sum(r.Addition); } static public T Mean<T, A, M>(this IEnumerable<T> E, IDivisionRing<T, A, M> r) where A : IGroup<T> where M : IGroup<T> { return r.Multiplication.Operation( r.Multiplication.Inverse( E.Count(r) ), E.Sum(r.Addition) ); } static public T Variance<T, A, M>(this IEnumerable<T> E, IDivisionRing<T, A, M> r) where A : IGroup<T> where M : IGroup<T> { T average = E.Mean(r); return r.Multiplication.Operation( r.Multiplication.Inverse( E.Count(r) ), E .Map((x) => r.Addition.Operation(x, r.Addition.Inverse(average))) .Map((x) => r.Multiplication.Operation(x, x) ) .Sum(r.Addition) ); } } static public class ModuleExtensions { static public TV Mean<TS, TV, TSR, TSRA, TSRM, TVA>(this IEnumerable<TV> E, IVectorField<TS, TV, TSR, TSRA, TSRM, TVA> m) where TSR : IField<TS, TSRA, TSRM> where TSRA : IAbelianGroup<TS> where TSRM : IAbelianGroup<TS> where TVA : IAbelianGroup<TV> { return m.Distribute( m.Scalar.Multiplication.Inverse( E.Count(m.Scalar) ), E.FoldL( m.Vector.Identity, m.Vector.Operation ) ); } }
Conclusion
Abstract Algebra comes with a rich history and theory for dealing with different algebraic structures that are easily represented and used in the C# language to perform type agnostic computations. Several examples drawn from mathematics and computer science illustrated how the solution can be used for both value and reference types in C# and be leveraged in the context of a few example type agnostic computations. The main benefit of this approach is that it minimizes the repetitious coding of computations required under the adhoc polymorphism approach adopted by the designers of C# language. The downside is that several structures must be defined for the types being computed over and working with C# parameter constraint system can be unwieldy. While an interesting study, this solution would not be practical in a production setting under the current capabilities of the C# language.
References
Baeten, J.C.M. A Brief History of Process Algebra [pdf]. Department of Computer Science, Technische Universiteit Eindhoven. 31 Mar. 2012.
ECMA International. Standard ECMA335 Common Language Infrastructure [pdf]. 2006.
Fokkink, Wan. Introduction of Process Algebra [pdf]. 2nd ed. Berlin: SpringerVerlang, 2007. 10 Apr. 2007. 31 Mar. 2012.
Goodman, Joshua. Semiring Parsing [pdf]. Computational Linguistics 25 (1999): 573605. Microsoft Research. 31 Mar. 2012.
Hungerford, Thomas. Algebra. New York: Holt, Rinehart and Winston, 1974.
Ireland, Kenneth. A classical introduction to modern number theory. New York: SpringerVerlag, 1990.
Litvinov, G. I., V. P. Maslov, and A. YA Rodionov. Universal Algorithms, Mathematics of Semirings and Parallel Computations [pdf]. Spring Lectures Notes in Computational Science and Engineering. 7 May 2010. 31 Mar. 2012 .
Mazurkiewicz, Antoni. Introduction to Trace Theory [pdf]. Rep. 19 Nov. 1996. Institute of Computer Science, Polish Academy of Sciences. 31 Mar. 2012.
Pierce, Benjamin. Types and programming languages. Cambridge, Mass: MIT Press, 2002.
Stroustrup, Bjarne. The C++ Programming Language. Reading, Mass: AddisonWesley, 1986.
Menger Sponge in C++ using OpenGL
This past summer I was going through some old projects and came across a Menger Sponge visualizer that I wrote back in college. A Menger Sponge is simple fractal that has infinite surface area and encloses zero volume. The sponge is constructed in successive iterations and the first four iterations are rendered in the video below.
The sponge starts as a single cube that is segmented into twentyseven equally sized cubes. The center cubes of each face and that of the parent cube are then discarded and the process is applied again to each of the remaining cubes. Visually, the process looks like the following:
The geometry of the process is straight forward. Starting with a cube’s origin, , and edge length, , each of the children’s attributes can be calculated. Each child’s edge length is given by . Each child’s origin given by . The constant represents a child’s relative position (e.g., ) to its parent.
The following implementation isn’t particularly well written, but it accomplishes the desired end result. The point
and Cube
classes achieve the logic that I’ve outlined above. Cube
can be thought of as a tree structure that is generated upon instantiation. The visualize()
method pumps out the desired OpenGL commands to produce the faces of the cubes.
#include <GL\glut.h> #include <math.h> #include <stdlib.h> #include <stdio.h> #include <string.h> //================================================================================= //================================================================================= class point { public: point(GLfloat x, GLfloat y, GLfloat z, point* ref = NULL); void visualize(); GLfloat x,y,z; }; point::point(GLfloat x, GLfloat y, GLfloat z, point* ref) { this>x = x; this>y = y; this>z = z; if(ref != NULL) { this>x += ref>x; this>y += ref>y; this>z += ref>z; } } //================================================================================= //================================================================================= class Cube { public: Cube(point* origin, GLfloat edgelength, GLfloat depth); ~Cube(); void visualize(); private: void MakeFace(int i, int j, int k, int l); void ActAsContainer(point* o, GLfloat e, GLfloat d); void ActAsCube(point* o, GLfloat e); point** PointCloud; Cube** SubCubes; }; Cube::Cube(point* origin, GLfloat edgelength, GLfloat depth) { if(depth <= 1.0) { ActAsCube(origin, edgelength); } else { ActAsContainer(origin, edgelength, depth); } } Cube::~Cube() { int i; if(PointCloud != NULL) { for(i = 0; i < 8; i++) delete PointCloud[i]; delete[] PointCloud; } if(SubCubes != NULL) { for(i = 0; i < 20; i++) delete SubCubes[i]; delete[] SubCubes; } } void Cube::ActAsCube(point* o, GLfloat e) { GLfloat ne = e / 2.0; PointCloud = new point*[8]; // This is the actual physical cube coordinates; SubCubes = NULL; PointCloud[0] = new point( ne, ne, ne, o); // net PointCloud[1] = new point( ne, ne, ne, o); // set PointCloud[2] = new point(ne, ne, ne, o); // nwt PointCloud[3] = new point(ne, ne, ne, o); // swt PointCloud[4] = new point( ne, ne, ne, o); // neb PointCloud[5] = new point( ne, ne, ne, o); // seb PointCloud[6] = new point(ne, ne, ne, o); // nwb PointCloud[7] = new point(ne, ne, ne, o); // swb } void Cube::ActAsContainer(point* o, GLfloat e, GLfloat d) { GLfloat ne = e / 3.0; SubCubes = new Cube*[20]; // These are the centers of each sub cube structure PointCloud = NULL; SubCubes[0] = new Cube(new point(ne, ne, ne, o), ne, d1.0); SubCubes[1] = new Cube(new point(0.0, ne, ne, o), ne, d1.0); SubCubes[2] = new Cube(new point( ne, ne, ne, o), ne, d1.0); SubCubes[3] = new Cube(new point( ne, 0.0, ne, o), ne, d1.0); SubCubes[4] = new Cube(new point( ne, ne, ne, o), ne, d1.0); SubCubes[5] = new Cube(new point(0.0, ne, ne, o), ne, d1.0); SubCubes[6] = new Cube(new point(ne, ne, ne, o), ne, d1.0); SubCubes[7] = new Cube(new point(ne, 0.0, ne, o), ne, d1.0); SubCubes[8] = new Cube(new point( ne, ne, 0.0, o), ne, d1.0); SubCubes[9] = new Cube(new point( ne, ne, 0.0, o), ne, d1.0); SubCubes[10] = new Cube(new point(ne, ne, 0.0, o), ne, d1.0); SubCubes[11] = new Cube(new point(ne, ne, 0.0, o), ne, d1.0); SubCubes[12] = new Cube(new point(ne, ne, ne, o), ne, d1.0); SubCubes[13] = new Cube(new point(0.0, ne, ne, o), ne, d1.0); SubCubes[14] = new Cube(new point( ne, ne, ne, o), ne, d1.0); SubCubes[15] = new Cube(new point( ne, 0.0, ne, o), ne, d1.0); SubCubes[16] = new Cube(new point( ne, ne, ne, o), ne, d1.0); SubCubes[17] = new Cube(new point(0.0, ne, ne, o), ne, d1.0); SubCubes[18] = new Cube(new point(ne, ne, ne, o), ne, d1.0); SubCubes[19] = new Cube(new point(ne, 0.0, ne, o), ne, d1.0); } void Cube::MakeFace(int i, int j, int k, int l) { glVertex3f(PointCloud[i]>x, PointCloud[i]>y, PointCloud[i]>z); glVertex3f(PointCloud[j]>x, PointCloud[j]>y, PointCloud[j]>z); glVertex3f(PointCloud[k]>x, PointCloud[k]>y, PointCloud[k]>z); glVertex3f(PointCloud[l]>x, PointCloud[l]>y, PointCloud[l]>z); } void Cube::visualize() { int i; if(PointCloud != NULL) { glBegin(GL_QUADS); glColor3f(1.0,0.0,0.0);// top MakeFace(0,2,3,1); glColor3f(0.0,1.0,1.0);//bottom MakeFace(4,6,7,5); glColor3f(0.0,1.0,0.0);// north MakeFace(0,2,6,4); glColor3f(1.0,0.0,1.0);// south MakeFace(1,3,7,5); glColor3f(0.0,0.0,1.0);//east MakeFace(0,4,5,1); glColor3f(1.0,1.0,0.0);// west MakeFace(2,6,7,3); glEnd(); } if(SubCubes != NULL) { for(i = 0; i < 20; i++) { SubCubes[i]>visualize(); } } }
The implementation of the program is your runofthemill OpenGL boilerplate. The application takes in an argument dictating what order of sponge it should produce. It sets up the camera and positions the sponge at the origin. The sponge is left stationary, while the camera is made to orbit upon each display()
. On idle()
, a redisplay message is sent back to the OpenGL system in order to achieve the effect that the sponge is spinning.
//================================================================================= //================================================================================= Cube* MengerCube; void idle() { glutPostRedisplay(); } void display() { static GLfloat rtri = 0.0; glClear(GL_COLOR_BUFFER_BIT  GL_DEPTH_BUFFER_BIT); glMatrixMode(GL_MODELVIEW); glLoadIdentity(); gluLookAt(1.0,1.0,1.0, 0.0,0.0,0.0,0.0,1.0,0.0); glRotatef((rtri+=0.932), 1.0, 0.5, 1.0); MengerCube>visualize(); glutSwapBuffers(); } void reshape(int w, int h) { glViewport(0,0,w,h); glMatrixMode(GL_PROJECTION); glLoadIdentity(); glOrtho(8.0, 8.0,8.0, 8.0,8.0, 8.0); } void init() { glShadeModel(GL_SMOOTH); glClearColor(0.0, 0.0, 0.0, 0.0); glClearDepth(1.0f); glEnable(GL_DEPTH_TEST); glColor3f(1.0, 1.0, 1.0); } GLfloat getDepth(char* depth) { int k = atoi(depth); if(k <= 1) return 1.0; else if (k>= 5) return 5.0; else return (GLfloat) k; } int main(int argc, char* argv[]) { GLfloat depth; bool viewAsPointCloud = false; point origin(0.0, 0.0, 0.0); printf("%d\n",argc); switch(argc) { case 2: depth = getDepth(argv[1]); break; default: depth = 2.0; break; } MengerCube = new Cube(&origin, 8.0, depth); glutInit(&argc, argv); glutInitDisplayMode(GLUT_DOUBLE  GLUT_RGB  GLUT_DEPTH); glEnable(GL_DEPTH_TEST); glutInitWindowSize(500,500); glutInitWindowPosition(0,0); glutCreateWindow(*argv); glutReshapeFunc(reshape); glutDisplayFunc(display); glutIdleFunc(idle); init(); glutMainLoop(); delete MengerCube; }