Archive for the ‘Computer Science’ Category
GPU Accelerated Expectation Maximization for Gaussian Mixture Models using CUDA
C, CUDA, and Python source code available on GitHub
Introduction
Gaussian Mixture Models [1, 435-439] offer a simple way to capture complex densities by employing a linear combination of multivariate normal distributions, each with their own mean, covariance, and mixture coefficient,
, s.t.
.
Of practical interest is the learning of the number of components and the values of the parameters. Evaluation criteria, such as Akaike and Bayesian, can be used to identify the number of components, or non-parametric models like Dirichlet processes can be used to avoid the matter all together. We won’t cover these techniques here, but will instead focus on finding the values of the parameters given sufficient training data using the Expectation-Maximization algorithm [3], and doing so efficiently on the GPU. Technical considerations will be discussed and the work will conclude with an empirical evaluation of sequential and parallel implementations for the CPU, and a massively parallel implementation for the GPU for varying numbers of components, points, and point dimensions.
Multivariate Normal Distribution
The multivariate normal distribution With mean, , and symmetric, positive definite covariance,
, is given by:
From a computational perspective, we will be interested in evaluating the density for values. Thus, a naive implementation would be bounded by
due to the matrix determinate in the normalization term. We can improve upon this by computing the Cholesky factorization,
, where
is a lower triangular matrix [6, 157-158]. The factorization requires
time and computing the determinate becomes
by taking advantage of the fact that
. Further, we can precompute the factorization and normalization factor for a given parameterization which leaves us with complexity of the Mahalanobis distance given by the quadratic form in the exponential. Naive computation requires one perform two vector matrix operations and find the inverse of the covariance matrix with worst case behavior
. Leveraging the Cholesky factorization, we’ll end up solving a series of triangular systems by forward and backward substitution in
and completing an inner product in
as given by
,
, and
. Thus, our pre-initialization time is
and density determination given by
. Further optimizations are possible by considering special diagonal cases of the covariance matrix, such as the isotropic,
, and non-isotropic,
, configurations. For robustness, we’ll stick with the full covariance.
To avoid numerical issues such as overflow and underflow, we’re going to consider throughout the remainder of the work. For estimates of the covariance matrix, we will want more samples than the dimension of the data to avoid a singular covariance matrix [4]. Even with this criteria satisfied, it may still be possible to produce a singular matrix if some of the data are collinear and span a subspace of
.
Expectation Maximization
From an unsupervised learning point of view, GMMs can be seen as a generalization of k-means allowing for partial assignment of points to multiple classes. A possible classifier is given by . Alternatively, multiple components can be used to represent a single class and we argmax over the corresponding subset sums. The utility of of GMMs goes beyond classification, and can be used for regression as well. The Expectation-Maximization (EM) algorithm will be used to find the parameters of of the model by starting with an initial guess for the parameters given by uniform mixing coefficients, means determined by the k-means algorithm, and spherical covariances for each component. Then, the algorithm iteratively computes probabilities given a fixed set of parameters, then updating those parameters by maximizing the log-likelihood of the data:
Because we are dealing with exponents and logarithms, it’s very easy to end up with underflow and overflow situations, so we’ll continue the trend of working in log-space and also make use of the “log-sum-exp trick” to avoid these complications:
Where the term is the maximum exponential argument within a stated sum. Within the expectation stage of the algorithm we will compute the posterior distributions of the components conditioned on the training data (we omit the mixing coefficient since it cancels out in the maximization steps of
and
, and account for it explicitly in the update of
):
The new parameters are resolved within the maximization step:
The algorithm continues back and forth between expectation and maximization stages until the change in log likelihood is less than some epsilon, or a maximum number of user specified iterations has elapsed.
Implementations
Sequential Per iteration complexity given by . We expect
because too many dimensions leads to a lot of dead space and too many components results in overfitting of the data. Thus, the dominating term for sequential execution is given by
.
Parallel There are two natural data parallelisms that appear in the algorithm. The calculation of the and
across points, while the probability densities and parameter updates have natural parallelisms across components. Each POSIX thread runs the full iterative algorithm with individual stages coordinated by barrier synchronization. Resulting complexity is given by
for work coordinated across
processors.
Massively Parallel The parallel implementation can be taken and mapped over to the GPU with parallelism taken across points and components depending on the terms being computed. There are several types of parallelism that we will leverage under the CUDA programming model. For the calculation of we compute each point in parallel by forming a grid of one dimensional blocks, and use streams with event synchronization to carry out each component in parallel across the streaming multiprocessors. Calculation of the loglikelihood and
is done by computing and storing
, then updating the storage for
, and then performing a parallel reduction over
to produce the loglikelihood. Parallel reductions are a core tasks are implemented by first standardizing the input array of points to an supremum power of two, then reducing each block using shared memory, and applying a linear map to the memory so that successive block reductions can be applied. Several additional approaches are discussed in [5]. Once the loglikelihood is computed, the streams are synchronized with the host and the result is copied from the device back to the host. To compute
,
is copied to a working memory and a maximum parallel reduction is performed. The resulting maximum is used in a separate exponential map for numerical stability when computing the parallel reduction of each component to yield
. Updates to the mean and covariances are performed by mapping each term to a working memory allocated for each component’s stream and executing a parallel reduction to yield the updated mean and covariance. Once all component streams have been synchronized, the mixture coefficients and Cholesky decompositions of the covariances is computed with a single kernel invocation parallel in the number of components.
The main design consideration was whether or not use streams. For larger numbers of components, this will result in improved runtime performance, however, it comes at the cost of increased memory usage which limits the size of problems an end user can study with the implementation. Because the primary design goal is performance, the increase in memory was favorable to using less memory and executing each component sequentially.
To optimize the runtime of the implementation nvprof along with the NVIDIA Visual Profiler was used to identify performance bottlenecks. The original implementation was a naive port of the parallel C code which required frequent memory transfers between host and device resulting in significant CUDA API overhead that dominated the runtime. By transferring and allocating memory on the device beforehand, this allowed the implementation to execute primarily on the GPU and eliminate the API overhead. The second primary optimization was using streams and events for parallelization of the component probability densities and parameter updates in the maximization step. In doing so, this allowed for a fold reduction since the components calculations would be performed in parallel. The next optimization step was to streamline the parallel reductions by using block reductions against fast shared block memory minimizing the number of global memory writes instead of performing iterated reductions against sequential addressing that preformed global memory reads and writes for each point. The final optimization step was to used pinned host memory to enable zero-copy transfers from DRAM to the GPU over DMA.
Evaluation
To evaluate the implementations we need a way of generating GMMs and sampling data from the resulting distributions. To sample from a standard univariate normal distribution one can use The Box-Muller transform, Zigguart method, or Ratio-of-uniforms method [7]. The latter is used here due to its simplicity and efficiency. Sampling from the multivariate normal distribution can by done by sampling a standard normal vector and computing
where
can be computed by Eigendecomposition,
, or Cholesky factorization,
. The latter is used since it is more efficient. The GMM describes a generative process whereby we pick a component at random with probability given by its mixture coefficient and then sample the underlying
distribution, and perform this process for the desired number of points.
The matter of generating GMMs it more interesting. Here we draw for
, alternatively, one could draw
. Means are drawn by
with
so that means are relatively spread out in
. The more exciting prospect is how to sample the covariance matrix. This is where the Wishart distribution,
for
, comes in handy. The Wishart distribution is a model of what the sample covariance matrix should look like given a series of
vectors. Based on a
method by [8], [9] gives an equally efficient method for sampling
by letting
and
for
and
.
To evaluate the performance of the different implementations, the wall clock time taken to run the algorithm on a synthetic instance was measured by varying each of the ,
, and
parameters while holding the other two fixed. From an end user perspective wall clock time is preferable to the time the operating system actually devoted to the problem since wall clock time is more valuable. There will be variability in the results since each instance requires a different number of iterations for the log likelihood to converge. Tests were conducted on a Xeon 1245 v5 3.5 Ghz system with 32GB of memory and an NVIDIA GTX 1060 6GB graphics card with 1280 cores.
Since the parameter space is relatively large Figures 2-5 look at varying one parameter will fixing the others to demonstrate the relative merits of each approach. When the number of points dominates the CUDA approach tends to be 18x faster; the Parallel approach tends to be 3x faster when the dimension is high; and CUDA is suitable when the num of components is high giving a 20x improvement relative to the sequential approach. Thus, when dealing with suitably large datasets, the CUDA based implementation is preferable delivering superior runtime performance without sacrificing quality.
It is important to note that the results obtained from the CUDA solution may differ to those the sequential and parallel approaches. This is due to nondeterministic round off errors associated with executing parallel reductions compared to sequential reductions [2], and differences in the handling of floating point values on the GPU [10], notably, the presence of fused multiple add on NVIDIA GPUs which are more accurate than what is frequently implemented in CPU architectures. The following two synthetic data sets illustrate typical results of the three schemes:
Conclusion
This work demonstrated the utility of using NVIDIA GPUs to train Gaussian mixture models by the Expectation Maximization algorithm. Speedups as high as 20x were observed on synthetic datasets by varying the number of points, components, and data dimension while leaving the others fixed. It is believed that further speedups should be possible with additional passes, and the inclusion of metric data structures to limit which data is considered during calculations. Future work would pursue more memory efficient solutions on the GPU to allow for larger problem instance, and focus on providing higher level language bindings so that it can be better utilized in traditional data science toolchains.
References
- Bishop, C. M. Pattern recognition and machine learning. Springer, 2006.
- Collange, S., Defour, D., Graillat, S., and Lakymhuk, R. Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Computing 49 (2015), 83-97.
- Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the eme algorithm. Journal of the royal statistical society. Series B (methodological) (1977), 1-38.
- Fan, J., Liao, Y., and Liu, H. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal 19, (2016) C1-C32.
- Harris, M. Optimizing cuda. SC07: High Performance Computing with CUDA (2007).
- Kincaid, D., and Cheney, W. Numerical analysis: mathematics of scientific computing. 3 ed. Brooks/Cole, 2002.
- Kinderman, A. J., and Monahan, J. F. Computer generation of random variables using the ratio of uniform deviates. ACM Transactions on Mathematical Software (TOMS) 3, 3 (1977), 257-260.
- Odell, P., and Feiveson, A. A Numerical procedure to generate a sample covariance matrix. Journal of the American Statistical Association 61, 313 (1966), 199-203.
- Sawyer, S. Wishart distributions and inverse-wishart sampling. URL: http://www.math.wustl.edu/~sawyer/hmhandouts/Wishart.pdf (2007).
- Whitehead, N., and Fit-Florea, A. Precision and performance: Floating point and ieee 754 compliance for nvidia gpus. rn(A + B) 21., 1 (2011), 18749-19424.
A Greedy Approximation Algorithm for the Linear Assignment Problem
Starting today, I will be posting some of the related source code for articles on GitHub.
Introduction
The Linear Assignment Problem (LAP) is concerned with uniquely matching an equal number of workers to tasks, , such that the overall cost of the pairings is minimized. A polynomial time algorithm was developed in the late fifties by [6], and further refined by [9], called the Hungarian method. Named so after the work of Hungarian mathematicians König and Egerváry whose theorems in the 1930s form the basis for the method. While the Hungarian Method can solve LAP instances in
time, we wish to find faster algorithms even if it means sacrificing optimality in the process. Here we examine a greedy
-approximation algorithm with
runtime in terms of its approximation factor and compare it empirically to the Hungarian method.
Linear Assignment Problem
The above linear program has cost, , and assignment,
, matrices that specify the terms of the LAP. This is equivalent to finding a perfect matching in a weighted bipartite graph. A minimal cost may have several possible assignments, but we are only interested in finding just one. It is assumed that no one worker can do all jobs more efficiently by themselves than the distributing work across all workers. Likewise, if the costs are thought of as durations, then the minimum cost is the minimum sequential rather than parallel time taken to complete the tasks.
From a practical point of view, we may relax the integral constraint on and allow all positive real-valued costs. For instances where there are more jobs than workers, and vice versa, dummy entries valued greater than the existing maximum may be added. Minimizing the cost is the default objective, but the maximum cost can be found by finding the optimal assignment for
, then finding the cost relative to
.
Algorithms
Brute Force Rather than using the mathematical programming or graph theoretic representation of the problem, we can instead view the problem as finding the assignment that minimizes the cost out of all possible assignments:
There are such assignments which can be produced using an iterative version of Heap’s algorithm [5] in
time assuming one does differential scoring (opposed to calculating the score for each permutation which would result in an
algorithm.)
Random The random algorithm selects a permutation uniformly from the set of all possible assignment permutations in
time using the Fisher-Yates shuffle [4]. This obviously does not produce an optimal or near-optimal solution, but serves as a straw man to compare other results.
Greedy The greedy heuristic continues to cover the row and column of the smallest uncovered entry in the cost matrix until all entries are covered. The resulting set of entries then constitutes the assignment of workers to jobs. An inefficient algorithm can be used to find the smallest entry every iteration, or a more efficient result of
can be obtained through the use of a sorted, array indexed hybrid mesh and queue. Let
represent a tuple consisting of row, column, and value; the previous entry in the matrix
this value, and the next entry in this matrix
this value; and the
(left, above, right, below) that are adjacent to this node.
Algorithm 1 A greedy algorithm for the LAP.
// Adjacent node left, above, right, below properties
// Sort in ascending order by node value
// Adjacent node previous and next properties
// Deletes row and col of
Allocating and linking for assignment is ; mesh
; queue
. Therefore, initialization requires
time. The body of the loop requires a constant time assignment of worker to job, and
time to remove the row and column from a
matrix using a modified depth first search. Thus, the loop itself accounts for
time. The resulting time complexity is therefore
.
Breaking ties for the minimum uncovered value can result in different costs. This drawback is shown in the above example were choosing at
yields a minimum cost of
, where as the one at
gives a minimum cost of
. The next progression in the design of the greedy algorithm would be to try all minimum positions and keep the top
performing paths.
Hungarian The general idea behind the Kuhn-Munkres algorithm is that if we are given an initial assignment, we can make further assignments and potentially reassign workers until all workers have been tasked with a job. The high-level sketch of the algorithm starts with an initial assignment. While we have jobs that are unassigned, we look for qualified workers, ie, the zero entries. If a worker is already assigned to a job, but is also qualified for another, then we prime the alternative and continue to the next qualified worker, but if that is the only job the worker is qualified for, then we’d like to reassign any other worker already tasked to that job. This leads to a natural ripple effect represented by an alternating path of starred and primed entries. In Munkres’ paper [9] “starred” zero’s represent assignments of workers to jobs, and “primed” zero’s are alternative assignments. By flipping the bits of the path, we reassign workers to their alternative tasks while ensuring the assignment continues to be minimal by construction. After assigning as many workers as we have to, we then deduct the lowest cost to create a new qualified worker. Thus, every iteration we are guaranteed to make positive progress towards our goal of finding an optimal assignment. This scheme results in the worst case time to complete.
Algorithm 2 The Hungarian method for the LAP.
- Star the first uncovered zero in row
, cover the corresponding column
for
All columns not covered
Uncovered zeros
- Prime the current uncovered zero
There’s a starred zero in this row
- Uncover the starred zero’s column and cover the row
-
- Find an alternating augmented path from the primed zero
- Unstar the starred zeros on the path and star the primed zeros on the path
- Remove all the prime markings and cover all stared zeros
Found path
-
over all uncovered
for all uncovered columns
for all covered rows
Starred zeros // These are all the assignments
To further illustrate the algorithm, consider the following example where starred entries are denoted by red, and primed entries by green:
Analysis
The prevailing convention in the literature is to look at the approximation factor, , to determine how close the results of an approximation algorithm are to optimal [10]. Here this ratio is the expected minimum cost assignment of the algorithm under test to the same quantity given by the expected minimum assignment cost. Let
be an
a standard exponential random cost matrix. We resort to the exponential distribution for its ease of analyis and prominence in related literature. Cf. the works of [7], [8] for analysis based on
.
Exponential Distribution Properties Let have cumulative distribution function
and expectation
. The distribution demonstrates the memoryless property for expectations
. Define the order statistic
to be the minimum of
draws from
.
[2] with expectation
. If
then
with expectation
.
Expected Minimum Cost The expected minimum assignment cost for is given by [1]:
Which is the generalized harmonic number of order two and converges to . For the generalized harmonic numbers,
,
for
.
Greedy The minimum value of an matrix is given by the order statistic
with expectation
. The expected value of the minimum cost assignment is not just
because the expectation doesn’t take into account the previous iteration’s minimum value. To accomplish this we make use of the memoryless property of the exponential distribution to observe that the expected difference in minimums between iterations is the expected minimum value given by
. If we add up all these differences we get the expected minimum value of the k’th iteration; summing all these expectations then yields the expected minimum cost assignment:
This is the harmonic number of order one which does not converge. The resulting approximation factor is:
Random The random algorithm will simply select an assignment permutation, so we are just adding up
distributed random variables leading to an expected cost of:
And approximation factor:
From this analysis one concludes that the greedy algorithm has an unbounded approximation factor that grows significantly slower than that of randomly selecting assignments.
Evaluation
To illustrate the preceding results, Figure 1 shows the approximation factor for the greedy algorithm implementations against the derived approximation factor. The simulated results are based on 120 standard exponentially distributed matrices for
. Using the same conventions for the approximation factor, Figure 2 illustrates the runtime characteristics of the algorithms after rejecting outliers due to system fluctuations. Results obtained from source code compiled with -O3 flags and ran on a Xeon E3-1245 v5 3.5 Ghz system with 32 GBs of 2133Mhz DDR4 RAM. The algorithms coincide with the theoretical time complexities as shown in Table 2.
Solver | MSE |
---|---|
GREEDY-EFFICIENT | 0.002139 |
GREEDY-NAIVE | 0.014161 |
HUNGARIAN | 0.232998 |
Summary
Brute | Random | Greedy | Hungarian | |
---|---|---|---|---|
Complexity | ||||
1 | 1 |
Exact solutions can be delivered by the brute method when a handful of workers are being considered, and the Hungarian method should be considered for all other instances. Approximate solutions can be provided by the greedy algorithm with logarithmic degeneracy while providing a linear factor improvement over the Hungarian method. For inputs greater than those considered, the parallel Auction algorithm [3] is a suitable alternative and the subject of future work.
References
- Aldous, D. J. The
limit in the random assignment problem. Random Structures & Algorithms 18, 4 (2001), 381-418.
- Balakrishnan, N., and Rao, C. Handbook of statistics 16: Order statistics-theory and methods, 2000.
- Bertsekas, D. P. The auction algorithm: A distributed relaxation method for the assignment problem. Annals of operation research 4, 1 (1988), 105-123.
- Durtenfeld, R. Algorithm 235: random permutation. Communications of the ACM 7, 7 (1964), 420.
- Heap, B. Permutations by interchanges. The Computer Journal 6, 3 (1963), 293-298.
- Kuhn, H. W. The hungarian method for the assignment problem. Naval research logistics quarterly 2, 1-2 (1955), 83097.
- Kurtzberg, J. M. On approximation methods for the assignment problem. Journal of the ACM (JACM) 9, 4 (1962), 419-439.
- Steele, M. J. Probability and statistics in the service of computer science: illustrations using the assignment problem. Communications in Statistics-Theory and Methods 19, 11 (1990), 4315-4329.
- Munkres, J. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5, 1 (1957), 32-38.
- Williamson, D. P., and Shmoys, D. B. The design of approximation algorithms. Cambridge university press, 2011.
k-Means Clustering using CUDAfy.NET
Introduction
I’ve been wanting to learn how to utilize general purpose graphics processing units (GPGPUs) to speed up computation intensive machine learning algorithms, so I took some time to test the waters by implementing a parallelized version of the unsupervised k-means clustering algorithm using CUDAfy.NET– a C# wrapper for doing parallel computation on CUDA-enabled GPGPUs. I’ve also implemented sequential and parallel versions of the algorithm in C++ (Windows API), C# (.NET, CUDAfy.NET), and Python (scikit-learn, numpy) to illustrate the relative merits of each technology and paradigm on three separate benchmarks: varying point quantity, point dimension, and cluster quantity. I’ll cover the results, and along the way talk about performance and development considerations of the three technologies before wrapping up with how I’d like to utilize the GPGPU on more involved machine learning algorithms in the future.
Algorithms
Sequential
The traditional algorithm attributed to [Stu82] begins as follows:
- Pick
points at random as the starting centroid of each cluster.
- do (until convergence)
- For each point in data set:
- labels[point] = Assign(point, centroids)
- centroids = Aggregate(points, labels)
- convergence = DetermineConvergence()
- For each point in data set:
- return centroids
Assign
labels each point with the label of the nearest centroid, and Aggregate
updates the positions of the centroids based on the new point assignments. In terms of complexity, let’s start with the Assign
routine. For each of the points we’ll compute the distance to each of the
centroids and pick the centroid with the shortest distance that we’ll assign to the point. This is an example of the Nearest Neighbor Search problem. Linear search gives
which is preferable to using something like k-d trees which requires repeated superlinear construction and querying. Assuming Euclidean distance and points from
, this gives time complexity
. The
Aggregate
routine will take . Assuming convergence is guaranteed in
iterations then the resulting complexity is
which lends to an effectively linear algorithm.
Parallel
[LiFa89] was among the first to study several different shared memory parallel algorithms for k-means clustering, and here I will be going with the following one:
- Pick
points at random as the starting centroid of each cluster.
- Partition
points into
equally sized sets
- Run to completion threadId from 1 to
as:
- do (until convergence)
- sum, count = zero(
), zero(
)
- For each point in partition[threadId]:
- label = Assign(point, centroids)
- For each dim in point:
- sum[
* label + dim] += point[dim]
- sum[
- count[label] = count[label] + 1
- if(barrier.Synchronize())
- centroids = sum / count
- convergence = DetermineConvergence()
- sum, count = zero(
- do (until convergence)
- return centroids
The parallel algorithm can be viewed as smaller instances of the sequential algorithm processing
chunks of points in parallel. There are two main departures from the sequential approach 1) future centroid positions are accumulated and counted after each labeling and 2) each iteration of
while loops are synchronized before continuing on to the next iteration using a barrier – a way of ensuring all threads wait for the last thread to arrive, then continue to wait as the last one enters the barrier, and exits allowing the other threads to exit.
In terms of time complexity, Assign
remains unchanged at , and incrementing the sums and counts for the point’s label takes time
. Thus for
points, a single iteration of the loop gives
time. Given
threads, the maximum time would be given by the thread that enters the barrier, and assuming at most
iterations, then the overall complexity is
. Which suggests we should see at most a
speedup over the sequential implementation for large values of
.
GPGPU
The earliest work I found on doing k-means clustering on NVIDIA hardware in the academic literature was [MaMi09]. The following is based on that work, and the work I did above on the parallel algorithm:
- Pick
points at random as the starting centroid of each cluster.
- Partition
into
blocks such that each block contains no more than
points
- do (until convergence)
- Initialize sums, counts to zero
- Process blockId 1 to
,
at a time in parallel on the GPGPU:
- If threadId == 0
- Initialize blockSum, blockCounts to zero
- Synchronize Threads
- label = Assign(points[blockId *
+ threadId], centroids)
- For each dim in points[blockId *
+ threadId]:
- atomic blockSum[label * pointDim + dim] += points[blockId *
+ threadId]
- atomic blockSum[label * pointDim + dim] += points[blockId *
- atomic blockCount[label] += 1
- Synchronize Threads
- If threadId == 0
- atomic sums += blockSum
- atomic counts += blockCounts
- If threadId == 0
- centroids = sums / counts
- convergence = DetermineConvergence()
The initialization phase is similar to the parallel algorithm, although now we need to take into account the way that the GPGPU will process data. There are a handful of Streaming Multiprocessors on the GPGPU that process a single “block” at a time. Here we assign no more than points to a block such that each point runs as a single thread to be executed on each of the CUDA cores of the Streaming Multiprocessor.
When a single block is executing we’ll initialize the running sum and count as we did in the parallel case, then request that the threads running synchronize, then proceed to calculate the label of the point assigned to the thread atomically update the running sum and count. The threads must then synchronize again, and this time only the very first thread atomically copy those block level sum and counts over to the global sum and counts shared by all of the blocks.
Let’s figure out the time complexity. A single thread in a block being executed by a Streaming Multiprocessor takes time assuming that all
threads of the block execute in parallel, that there are
blocks, and
Streaming Multiprocessors, then the complexity becomes:
. Since
, and at most
iterations can go by in parallel, we are left with
. So the expected speedup over the sequential algorithm should be
.
Expected performance
For large values of , if we allow
to be significantly larger than
, we should expect the parallel version to 8x faster than the sequential version and the GPGPU version to be 255x faster than the sequential version given that
for the given set of hardware that will be used to conduct tests. For
to be significantly larger than
, then parallel is the same, and GPGPU version should be 340x faster than the sequential version. Now, it’s very important to point out that these are upper bounds. It is most likely that observed speedups will be significantly less due to technical issues like memory allocation, synchronization, and caching issues that are not incorporated (and difficult to incorporate) into the calculations.
Implementations
I’m going to skip the sequential implementation since it’s not interesting. Instead, I’m going to cover the C++ parallel and C# GPGPU implementations in detail, then briefly mention how scikit-learn was configured for testing.
C++
The parallel Windows API implementation is straightforward. The following will begin with the basic building blocks, then get into the high level orchestration code. Let’s begin with the barrier implementation. Since I’m running on Windows 7, I’m unable to use the convenient InitializeSynchronizationBarrier
, EnterSynchronizationBarrier
, and DeleteSynchronizationBarrier
API calls beginning with Windows 8. Instead I opted to implement a barrier using a condition variable and critical section as follows:
// ---------------------------------------------------------------------------- // Synchronization utility functions // ---------------------------------------------------------------------------- struct Barrier { CONDITION_VARIABLE conditionVariable; CRITICAL_SECTION criticalSection; int atBarrier; int expectedAtBarrier; }; void deleteBarrier(Barrier* barrier) { DeleteCriticalSection(&(barrier->criticalSection)); // No API for delete condition variable } void initializeBarrier(Barrier* barrier, int numThreads) { barrier->atBarrier = 0; barrier->expectedAtBarrier = numThreads; InitializeConditionVariable(&(barrier->conditionVariable)); InitializeCriticalSection(&(barrier->criticalSection)); } bool synchronizeBarrier(Barrier* barrier, void (*func)(void*), void* data) { bool lastToEnter = false; EnterCriticalSection(&(barrier->criticalSection)); ++(barrier->atBarrier); if (barrier->atBarrier == barrier->expectedAtBarrier) { barrier->atBarrier = 0; lastToEnter = true; func(data); WakeAllConditionVariable(&(barrier->conditionVariable)); } else { SleepConditionVariableCS(&(barrier->conditionVariable), &(barrier->criticalSection), INFINITE); } LeaveCriticalSection(&(barrier->criticalSection)); return lastToEnter; }
A Barrier
struct contains the necessary details of how many threads have arrived at the barrier, how many are expected, and structs for the condition variable and critical section.
When a thread arrives at the barrier (synchronizeBarrier
) it requests the critical section before attempting to increment the atBarrier
variable. It checks to see if it is the last to arrive, and if so, resets the number of threads at the barrier to zero and invokes the callback to perform post barrier actions exclusively before notifying the other threads through the condition variable that they can resume. If the thread is not the last to arrive, then it goes to sleep until the condition variable is invoked. The reason why LeaveCriticalSection
is included outside the the if statement is because SleepConditionVariableCS
will release the critical section before putting the thread to sleep, then reacquire the critical section when it awakes. I don’t like that behavior since its an unnecessary acquisition of the critical section and slows down the implementation.
There is a single allocation routine which performs a couple different rounds of error checking when calling calloc
; first to check if the routine returned null, and second to see if it set a Windows error code that I could inspect from GetLastError
. If either event is true, the application will terminate.
// ---------------------------------------------------------------------------- // Allocation utility functions // ---------------------------------------------------------------------------- void* checkedCalloc(size_t count, size_t size) { SetLastError(NO_ERROR); void* result = calloc(count, size); DWORD lastError = GetLastError(); if (result == NULL) { fprintf(stdout, "Failed to allocate %d bytes. GetLastError() = %d.", size, lastError); ExitProcess(EXIT_FAILURE); } if (result != NULL && lastError != NO_ERROR) { fprintf(stdout, "Allocated %d bytes. GetLastError() = %d.", size, lastError); ExitProcess(EXIT_FAILURE); } return result; }
Now on to the core of the implementation. A series of structs are specified for those data that are shared (e.g., points, centroids, etc) among the threads, and those that are local to each thread (e.g., point boundaries, partial results).
// ---------------------------------------------------------------------------- // Parallel Implementation // ---------------------------------------------------------------------------- struct LocalAssignData; struct SharedAssignData { Barrier barrier; bool continueLoop; int numPoints; int pointDim; int K; double* points; double* centroids; int* labels; int maxIter; double change; double pChange; DWORD numProcessors; DWORD numThreads; LocalAssignData* local; }; struct LocalAssignData { SharedAssignData* shared; int begin; int end; int* labelCount; double* partialCentroids; };
The assign
method does exactly what was specified in the parallel algorithm section. It will iterate over the portion of points it is responsible for, compute their labels and its partial centroids (sum of points with label , division done at aggregate step.).
void assign(int* label, int begin, int end, int* labelCount, int K, double* points, int pointDim, double* centroids, double* partialCentroids) { int* local = (int*)checkedCalloc(end - begin, sizeof(int)); int* localCount = (int*)checkedCalloc(K, sizeof(int)); double* localPartial = (double*)checkedCalloc(pointDim * K, sizeof(double)); // Process a chunk of the array. for (int point = begin; point < end; ++point) { double optDist = INFINITY; int optCentroid = -1; for (int centroid = 0; centroid < K; ++centroid) { double dist = 0.0; for (int dim = 0; dim < pointDim; ++dim) { double d = points[point * pointDim + dim] - centroids[centroid * pointDim + dim]; dist += d * d; } if (dist < optDist) { optDist = dist; optCentroid = centroid; } } local[point - begin] = optCentroid; ++localCount[optCentroid]; for (int dim = 0; dim < pointDim; ++dim) localPartial[optCentroid * pointDim + dim] += points[point * pointDim + dim]; } memcpy(&label[begin], local, sizeof(int) * (end - begin)); free(local); memcpy(labelCount, localCount, sizeof(int) * K); free(localCount); memcpy(partialCentroids, localPartial, sizeof(double) * pointDim * K); free(localPartial); }
One thing that I experimented with that gave me better performance was allocating and using memory within the function instead of allocating the memory outside and using within the assign
routine. This in particular was motivated after I read about false sharing where two separate threads writing to the same cache line cause coherence updates to cascade in the CPU causing overall performance to degrade. For labelCount
and partialCentroids
they’re reallocated since I was concerned about data locality and wanted the three arrays to be relatively in the same neighborhood of memory. Speaking of which, memory coalescing is used for the points array so that point dimensions are adjacent in memory to take advantage of caching. Overall, a series of cache friendly optimizations.
The aggregate
routine follows similar set of enhancements. The core of the method is to compute the new centroid locations based on the partial sums and centroid assignment counts given by args->shared->local[t].partialCentroids
and args->shared->local[t].labelCount[t]
. Using these partial results all the routine to complete in time which assuming all of these parameters are significantly less than
, gives a constant time routine. Once the centroids have been updated, the change in their location is computed and used to determine convergence along with how many iterations have gone by. Here if more than 1,000 iterations have occurred or the relative change in position is less than some tolerance (0.1%) then the threads will terminate.
void aggregate(void * data) { LocalAssignData* args = (LocalAssignData*)data; int* assignmentCounts = (int*)checkedCalloc(args->shared->K, sizeof(int)); double* newCentroids = (double*)checkedCalloc(args->shared->K * args->shared->pointDim, sizeof(double)); // Compute the assignment counts from the work the threads did. for (int t = 0; t < args->shared->numThreads; ++t) for (int k = 0; k < args->shared->K; ++k) assignmentCounts[k] += args->shared->local[t].labelCount[k]; // Compute the location of the new centroids based on the work that the // threads did. for (int t = 0; t < args->shared->numThreads; ++t) for (int k = 0; k < args->shared->K; ++k) for (int dim = 0; dim < args->shared->pointDim; ++dim) newCentroids[k * args->shared->pointDim + dim] += args->shared->local[t].partialCentroids[k * args->shared->pointDim + dim]; for (int k = 0; k < args->shared->K; ++k) for (int dim = 0; dim < args->shared->pointDim; ++dim) newCentroids[k * args->shared->pointDim + dim] /= assignmentCounts[k]; // See by how much did the position of the centroids changed. args->shared->change = 0.0; for (int k = 0; k < args->shared->K; ++k) for (int dim = 0; dim < args->shared->pointDim; ++dim) { double d = args->shared->centroids[k * args->shared->pointDim + dim] - newCentroids[k * args->shared->pointDim + dim]; args->shared->change += d * d; } // Store the new centroid locations into the centroid output. memcpy(args->shared->centroids, newCentroids, sizeof(double) * args->shared->pointDim * args->shared->K); // Decide if the loop should continue or terminate. (max iterations // exceeded, or relative change not exceeded.) args->shared->continueLoop = args->shared->change > 0.001 * args->shared->pChange && --(args->shared->maxIter) > 0; args->shared->pChange = args->shared->change; free(assignmentCounts); free(newCentroids); }
Each individual thread follows the same specification as given in the parallel algorithm section, and follows the calling convention required by the Windows API.
DWORD WINAPI assignThread(LPVOID data) { LocalAssignData* args = (LocalAssignData*)data; while (args->shared->continueLoop) { memset(args->labelCount, 0, sizeof(int) * args->shared->K); // Assign points cluster labels assign(args->shared->labels, args->begin, args->end, args->labelCount, args->shared->K, args->shared->points, args->shared->pointDim, args->shared->centroids, args->partialCentroids); // Tell the last thread to enter here to aggreagate the data within a // critical section synchronizeBarrier(&(args->shared->barrier), aggregate, args); }; return 0; }
The parallel algorithm controller itself is fairly simple and is responsible for basic preparation, bookkeeping, and cleanup. The number of processors is used to determine the number of threads to launch. The calling thread will run one instance will the remaining instances will run on separate threads. The data is partitioned, then the threads are spawned using the
CreateThread
routine. I wish there was a Windows API that would allow me to simultaneously create threads with a specified array of arguments because CreateThread will automatically start the thread as soon as it’s created. If lots of threads are being created, then the first will wait a long time before the last one gets around to reaching the barrier. Subsequent iterations of the synchronized loops will have better performance, but it would be nice to avoid that initial delay. After kicking off the threads, the main thread will run its own block of data, and once all threads terminate, the routine will close open handles and free allocated memory.
void kMeansFitParallel(double* points, int numPoints, int pointDim, int K, double* centroids) { // Lookup and calculate all the threading related values. SYSTEM_INFO systemInfo; GetSystemInfo(&systemInfo); DWORD numProcessors = systemInfo.dwNumberOfProcessors; DWORD numThreads = numProcessors - 1; DWORD pointsPerProcessor = numPoints / numProcessors; // Prepare the shared arguments that will get passed to each thread. SharedAssignData shared; shared.numPoints = numPoints; shared.pointDim = pointDim; shared.K = K; shared.points = points; shared.continueLoop = true; shared.maxIter = 1000; shared.pChange = 0.0; shared.change = 0.0; shared.numThreads = numThreads; shared.numProcessors = numProcessors; initializeBarrier(&(shared.barrier), numProcessors); shared.centroids = centroids; for (int i = 0; i < K; ++i) { int point = rand() % numPoints; for (int dim = 0; dim < pointDim; ++dim) shared.centroids[i * pointDim + dim] = points[point * pointDim + dim]; } shared.labels = (int*)checkedCalloc(numPoints, sizeof(int)); // Create thread workload descriptors LocalAssignData* local = (LocalAssignData*)checkedCalloc(numProcessors, sizeof(LocalAssignData)); for (int i = 0; i < numProcessors; ++i) { local[i].shared = &shared; local[i].begin = i * pointsPerProcessor; local[i].end = min((i + 1) * pointsPerProcessor, numPoints); local[i].labelCount = (int*)checkedCalloc(K, sizeof(int)); local[i].partialCentroids = (double*)checkedCalloc(K * pointDim, sizeof(double)); } shared.local = local; // Kick off the threads HANDLE* threads = (HANDLE*)checkedCalloc(numThreads, sizeof(HANDLE)); for (int i = 0; i < numThreads; ++i) threads[i] = CreateThread(0, 0, assignThread, &local[i + 1], 0, NULL); // Do work on this thread so that it's just not sitting here idle while the // other threads are doing work. assignThread(&local[0]); // Clean up WaitForMultipleObjects(numThreads, threads, true, INFINITE); for (int i = 0; i < numThreads; ++i) CloseHandle(threads[i]); free(threads); for (int i = 0; i < numProcessors; ++i) { free(local[i].labelCount); free(local[i].partialCentroids); } free(local); free(shared.labels); deleteBarrier(&(shared.barrier)); }
C#
The CUDAfy.NET GPGPU C# implementation required a lot of experimentation to find an efficient solution.
In the GPGPU paradigm there is a host and a device in which sequential operations take place on the host (ie. managed C# code) and parallel operations on the device (ie. CUDA code). To delineate between the two, the [Cudafy]
method attribute is used on the static public method assign
. The set of host operations are all within the Fit
routine.
Under the CUDA model, threads are bundled together into blocks, and blocks together into a grid. Here the data is partitioned so that each block consists of half the maximum number of threads possible per block and the total number of blocks is the number of points divided by that quantity. This was done through experimentation, and motivated by Thomas Bradley’s Advanced CUDA Optimization workshop notes [pdf] that suggest at that regime the memory lines become saturated and cannot yield better throughput. Each block runs on a Streaming Multiprocessor (a collection of CUDA cores) having shared memory that the threads within the block can use. These blocks are then executed in pipeline fashion on the available Streaming Multiprocessors to give the desired performance from the GPGPU.
What is nice about the shared memory is that it is much faster than the global memory of the GPGPU. (cf. Using Shared Memory in CUDA C/C++) To make use of this fact the threads will rely on two arrays in shared memory: sum of the points and the count of those belonging to each centroid. Once the arrays have been zeroed out by the threads, all of the threads will proceed to find the nearest centroid of the single point they are assigned to and then update those shared arrays using the appropriate atomic operations. Once all of the threads complete that assignment, the very first thread will then add the arrays in shared memory to those in the global memory using the appropriate atomic operations.
using Cudafy; using Cudafy.Host; using Cudafy.Translator; using Cudafy.Atomics; using System; namespace CUDAfyTesting { public class CUDAfyKMeans { [Cudafy] public static void assign(GThread thread, int[] constValues, double[] centroids, double[] points, float[] outputSums, int[] outputCounts) { // Unpack the const value array int pointDim = constValues[0]; int K = constValues[1]; int numPoints = constValues[2]; // Ensure that the point is within the boundaries of the points // array. int tId = thread.threadIdx.x; int point = thread.blockIdx.x * thread.blockDim.x + tId; if (point >= numPoints) return; // Use two shared arrays since they are much faster than global // memory. The shared arrays will be scoped to the block that this // thread belongs to. // Accumulate the each point's dimension assigned to the k'th // centroid. When K = 128 => pointDim = 2; when pointDim = 128 // => K = 2; Thus max(len(sharedSums)) = 256. float[] sharedSums = thread.AllocateShared<float>("sums", 256); if (tId < K * pointDim) sharedSums[tId] = 0.0f; // Keep track of how many times the k'th centroid has been assigned // to a point. max(K) = 128 int[] sharedCounts = thread.AllocateShared<int>("counts", 128); if (tId < K) sharedCounts[tId] = 0; // Make sure all threads share the same shared state before doing // any calculations. thread.SyncThreads(); // Find the optCentroid for point. double optDist = double.PositiveInfinity; int optCentroid = -1; for (int centroid = 0; centroid < K; ++centroid) { double dist = 0.0; for (int dim = 0; dim < pointDim; ++dim) { double d = centroids[centroid * pointDim + dim] - points[point * pointDim + dim]; dist += d * d; } if (dist < optDist) { optDist = dist; optCentroid = centroid; } } // Add the point to the optCentroid sum for (int dim = 0; dim < pointDim; ++dim) // CUDA doesn't support double precision atomicAdd so cast down // to float... thread.atomicAdd(ref(sharedSums[optCentroid * pointDim + dim]), (float)points[point * pointDim + dim]); // Increment the optCentroid count thread.atomicAdd(ref(sharedCounts[optCentroid]), +1); // Wait for all of the threads to complete populating the shared // memory before storing the results back to global memory where // the host can access the results. thread.SyncThreads(); // Have to do a lock on both of these since some other Streaming // Multiprocessor could be running and attempting to update the // values at the same time. // Copy the shared sums to the output sums if (tId == 0) for (int i = 0; i < K * pointDim; ++i) thread.atomicAdd(ref(outputSums[i]), sharedSums[i]); // Copy the shared counts to the output counts if (tId == 0) for (int i = 0; i < K; i++) thread.atomicAdd(ref(outputCounts[i]), sharedCounts[i]); }
Before going on to the Fit
method, let’s look at what CUDAfy.NET is doing under the hood to convert the C# code to run on the CUDA-enabled GPGPU. Within the CUDAfy.Translator
namespace there are a handful of classes for decompiling the application into an abstract syntax tree using ICharpCode.Decompiler
and Mono.Cecil
, then converting the AST over to CUDA C via visitor pattern, next compiling the resulting CUDA C using NVIDIA’s NVCC compiler, and finally the compilation result is relayed back to the caller if there’s a problem; otherwise, a CudafyModule
instance is returned, and the compiled CUDA C code it represents loaded up on the GPGPU. (The classes and method calls of interest are: CudafyTranslator.DoCudafy
, CudaLanguage.RunTransformsAndGenerateCode
, CUDAAstBuilder.GenerateCode
, CUDAOutputVisitor
and CudafyModule.Compile
.)
private CudafyModule cudafyModule; private GPGPU gpgpu; private GPGPUProperties properties; public int PointDim { get; private set; } public double[] Centroids { get; private set; } public CUDAfyKMeans() { cudafyModule = CudafyTranslator.Cudafy(); gpgpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId); properties = gpgpu.GetDeviceProperties(true); gpgpu.LoadModule(cudafyModule); }
The Fit
method follows the same paradigm that I presented earlier with the C++ code. The main difference here is the copying of managed .NET resources (arrays) over to the device. I found these operations to be relatively time intensive and I did find some suggestions from the CUDAfy.NET website on how to use pinned memory- essentially copy the managed memory to unmanaged memory, then do an asynchronous transfer from the host to the device. I tried this with the points arrays since its the largest resource, but did not see noticeable gains so I left it as is.
At the beginning of each iteration of the main loop, the device counts and sums are cleared out through the Set
method, then the CUDA code is invoked using the Launch
routine with the specified block and grid dimensions and device pointers. One thing that the API does is return an array when you allocate or copy memory over to the device. Personally, an IntPtr
seems more appropriate. Execution of the routine is very quick, where on some of my tests it took 1 to 4 ms to process 100,000 two dimensional points. Once the routine returns, memory from the device (sum and counts) is copied back over to the host which then does a quick operation to derive the new centroid locations and copy that memory over to the device for the next iteration.
public void Fit(double[] points, int pointDim, int K) { if (K <= 0) throw new ArgumentOutOfRangeException("K", "Must be greater than zero."); if (pointDim <= 0) throw new ArgumentOutOfRangeException("pointDim", "Must be greater than zero."); if (points.Length < pointDim) throw new ArgumentOutOfRangeException("points", "Must have atleast pointDim entries."); if (points.Length % pointDim != 0) throw new ArgumentException("points.Length must be n * pointDim > 0."); int numPoints = points.Length / pointDim; // Figure out the partitioning of the data. int threadsPerBlock = properties.MaxThreadsPerBlock / 2; int numBlocks = (numPoints / threadsPerBlock) + (numPoints % threadsPerBlock > 0 ? 1 : 0); dim3 blockSize = new dim3(threadsPerBlock, 1, 1); dim3 gridSize = new dim3( Math.Min(properties.MaxGridSize.x, numBlocks), Math.Min(properties.MaxGridSize.y, (numBlocks / properties.MaxGridSize.x) + (numBlocks % properties.MaxGridSize.x > 0 ? 1 : 0)), 1 ); int[] constValues = new int[] { pointDim, K, numPoints }; float[] assignmentSums = new float[pointDim * K]; int[] assignmentCount = new int[K]; // Initial centroid locations picked at random Random prng = new Random(); double[] centroids = new double[K * pointDim]; for (int centroid = 0; centroid < K; centroid++) { int point = prng.Next(points.Length / pointDim); for (int dim = 0; dim < pointDim; dim++) centroids[centroid * pointDim + dim] = points[point * pointDim + dim]; } // These arrays are only read from on the GPU- they are never written // on the GPU. int[] deviceConstValues = gpgpu.CopyToDevice<int>(constValues); double[] deviceCentroids = gpgpu.CopyToDevice<double>(centroids); double[] devicePoints = gpgpu.CopyToDevice<double>(points); // These arrays are written written to on the GPU. float[] deviceSums = gpgpu.CopyToDevice<float>(assignmentSums); int[] deviceCount = gpgpu.CopyToDevice<int>(assignmentCount); // Set up main loop so that no more than maxIter iterations take // place, and that a realative change less than 1% in centroid // positions will terminate the loop. int maxIter = 1000; double change = 0.0, pChange = 0.0; do { pChange = change; // Clear out the assignments, and assignment counts on the GPU. gpgpu.Set(deviceSums); gpgpu.Set(deviceCount); // Lauch the GPU portion gpgpu.Launch(gridSize, blockSize, "assign", deviceConstValues, deviceCentroids, devicePoints, deviceSums, deviceCount); // Copy the results memory from the GPU over to the CPU. gpgpu.CopyFromDevice<float>(deviceSums, assignmentSums); gpgpu.CopyFromDevice<int>(deviceCount, assignmentCount); // Compute the new centroid locations. double[] newCentroids = new double[centroids.Length]; for (int centroid = 0; centroid < K; ++centroid) for (int dim = 0; dim < pointDim; ++dim) newCentroids[centroid * pointDim + dim] = assignmentSums[centroid * pointDim + dim] / assignmentCount[centroid]; // Calculate how much the centroids have changed to decide // whether or not to terminate the loop. change = 0.0; for (int centroid = 0; centroid < K; ++centroid) for (int dim = 0; dim < pointDim; ++dim) { double d = newCentroids[centroid * pointDim + dim] - centroids[centroid * pointDim + dim]; change += d * d; } // Update centroid locations on CPU & GPU Array.Copy(newCentroids, centroids, newCentroids.Length); deviceCentroids = gpgpu.CopyToDevice<double>(centroids); } while (change > 0.01 * pChange && --maxIter > 0); gpgpu.FreeAll(); this.Centroids = centroids; this.PointDim = pointDim; } } }
Python
I include the Python implementation for the sake of demonstrating how scikit-learn was invoked throughout the following experiments section.
model = KMeans( n_clusters = numClusters, init='random', n_init = 1, max_iter = 1000, tol = 1e-3, precompute_distances = False, verbose = 0, copy_x = False, n_jobs = numThreads ); model.fit(X); // X = (numPoints, pointDim) numpy array.
Experimental Setup
All experiments where conducted on a laptop with an Intel Core i7-2630QM Processor and NVIDIA GeForce GT 525M GPGPU running Windows 7 Home Premium. C++ and C# implementations were developed and compiled by Microsoft Visual Studio Express 2013 for Desktop targeting C# .NET Framework 4.5 (Release, Mixed Platforms) and C++ (Release, Win32). Python implementation was developed and compiled using Eclipse Luna 4.4.1 targeting Python 2.7, scikit-learn 0.16.0, and numpy 1.9.1. All compilers use default arguments and no extra optimization flags.
For each test, each reported test point is the median of thirty sample run times of a given algorithm and set of arguments. Run time is computed as the (wall) time taken to execute model.fit(points, pointDim, numClusters)
where time is measured by: QueryPerformanceCounter
in C++, System.Diagnostics.Stopwatch
in C#, and time.clock
in Python. Every test is based on a dataset having two natural clusters at .25 or -.25 in each dimension.
Results
Varying point quantity
![]() |
Both the C++ and C# sequential and parallel implementations outperform the Python scikit-learn implementations. However, the C++ sequential and parallel implementations outperforms their C# counterparts. Though the C++ sequential and parallel implementations are tied, as it seems the overhead associated with multithreading overrides any multithreaded performance gains one would expect. The C# CUDAfy.NET implementation surprisingly does not outperform the C# parallel implementation, but does outperform the C# sequential one as the number of points to cluster increases.
So what’s the deal with Python scikit-learn? Why is the parallel version so slow? Well, it turns out I misunderstood the nJobs parameter. I interpreted this to mean that process of clustering a single set of points would be done in parallel; however, it actually means that the number of simultaneous runs of the whole process will occur in parallel. I was tipped off to this when I noticed multiple python.exe fork processes being spun off which surprised me that someone would implement a parallel routine that way leading to a more thorough reading the scikit-learn documentation. There is parallelism going on with scikit-learn, just not the desired type. Taking that into account the linear one performs reasonably well for being a dynamically typed interpreted language.
Varying point dimension
![]() |
The C++ and C# parallel implementations exhibit consistent improved run time over their sequential counterparts. In all cases the performance is better than scikit-learn’s. Surprisingly, the C# CUDAfy.NET implementation does worse than both the C# sequential and parallel implementations. Why do we not better CUDAfy.NET performance? The performance we see is identical to the vary point quantity test. So on one hand it’s nice that increasing the point dimensions did not dramatically increase the run time, but ideally, the CUDAfy.NET performance should be better than the sequential and parallel C# variants for this test. My leading theory is that higher point dimensions result in more data that must be transferred between host and device which is a relatively slow process. Since I’m short on time, this will have to be something I investigate in more detail in the future.
Varying cluster quantity
![]() |
As in the point dimension test, the C++ and C# parallel implementations outperform their sequential counterparts, while the scikit-learn implementation starts to show some competitive performance. The exciting news of course is that varying the cluster size finally reveals improved C# CUDAfy.NET run time. Now there is some curious behavior at the beginning of each plot. We get performance for two clusters, then jump up into about
for four to eight clusters. Number of points and their dimension are held constant, but we allocate a few extra double’s for the cluster centroids. I believe this has to do with cache behavior. I’m assuming for fewer than four clusters everything that’s needed sits nicely in the fast L1 cache, and moving up to four and more clusters requires more exchanging of data between L1, L2, L3, and (slower) memory memory to the different cores of the Intel Core i7-2630QM processor I’m using. As before, I’ll need to do some more tests to verify that this is what is truly happening.
Language comparison
For the three tests considered, the C++ implementations gave the best run time performance on point quantity and point dimension tests while the C# CUDAfy.NET implementation gave the best performance on the cluster quantity test.
The C++ implementation could be made to run faster be preallocating memory in the same fashion that C# does. In C# when an application is first created a block of memory is allocated for the managed heap. As a result, allocation of reference types in C# is done by incrementing a pointer instead of doing an unmanaged allocation (malloc, etc.). (cf. Automatic Memory Management) This allocation takes place before executing the C# routines, while the same allocation takes place during the C++ routines. Hence, the C++ run times will have an overhead not present in the C# run times. Had I implemented memory allocation in C++ the same as it’s done in C#, then the C++ implementation would be undoubtedly even faster than the C# ones.
While using scikit-learn in Python is convenient for exploratory data analysis and prototyping machine learning algorithms, it leaves much to be desired in performance; frequently coming ten times slower than the other two implementations on the varying point quantity and dimension tests, but within tolerance on the vary cluster quantity tests.
Future Work
The algorithmic approach here was to parallelize work on data points, but as the dimension of each point increases, it may make sense to explore algorithms that parallelize work across dimensions instead of points.
I’d like to spend more time figuring out some of the high-performance nuances of programming the GPGPU (as well as traditional C++), which take more time and patience than a week or two I spent on this. In addition, I’d like to dig a little deeper into doing CUDA C directly rather than through the convenient CUDAfy.NET wrapper; as well as explore OpenMP and OpenCL to see how they compare from a development and performance-oriented view to CUDA.
Python and scikit-learn were used a baseline here, but it would be worth spending extra time to see how R and Julia compare, especially the latter since Julia pitches itself as a high-performance solution, and is used for exploratory data analysis and prototyping machine learning systems.
While the emphasis here was on trying out CUDAfy.NET and getting some exposure to GPGPU programming, I’d like to apply CUDAfy.NET to the expectation maximization algorithm for fitting multivariate Gaussian mixture models to a dataset. GMMs are a natural extension of k-means clustering, and it will be good to implement the more involved EM algorithm.
Conclusions
Through this exercise, we can expect to see modest speedups over sequential implementations of about 2.62x and 11.69x in the C# parallel and GPGPU implementations respectively when attempting to find large numbers of clusters on low dimensional data. Fortunately the way you use k-means clustering is to find the cluster quantity that maximizes the Bayesian information criterion or Akaike information criterion which means running the vary centroid quantity test on real data. On the other hand, most machine learning data is of a high dimension so further testing (on a real data set) would be needed to verify it’s effectiveness in a production environment. Nonetheless, we’ve seen how parallel and GPGPU based approaches can reduce the time it takes to complete the clustering task, and learned some things along the way that can be applied to future work.
Bibliography
[LiFa89] Li Xiaobo and Fang Zhixi, “Parallel clustering algorithms”, Parallel Computing, 1989, 11(3): pp.275-290.
[MaMi09] Mario Zechner, Michael Granitzer. “Accelerating K-Means on the Graphics Processor via CUDA.” First International Conference on Intensive Applications and Services, INTENSIVE’09. pp. 7-15, 2009.
[Stu82] Stuart P. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28:129-137, 1982.
Notes from SIGGRAPH 2015
Introduction
I recently flew out to Los Angeles to attend the 42nd International Conference and Exhibition on Computer Graphics and Interactive Techniques. SIGGRAPH‘s theme this year was the crossroads of discovery bringing it closer to its roots that began here in Boulder, Colorado back in 1974. For me it was a chance to dig a little deeper into Computer Graphics research following my recent studies and develop a better understanding of the industries pushing the domain forward. As with most posts on this site, this is a short reminder to myself, and hopefully gives others an idea of what they could expect if they went.
Production Sessions
Disney – Pixar’s “Lava”: Moving Mountains was an informative production session detailing the process of bringing “Lava” to the screen. “Lava” is the story of Uku, a lonely volcano in search of love. As millions of years go by, he begins to lose hope as he recedes back into the ocean. But all is not lost. Uku finds renewed hope for love as newly formed volcano Lele rises to the surface. After the Pixar magicians reveal their secrets, technical details, and engrossing backstory, “Lava” becomes an even more enjoyable short film.
The presentation began with director James Murphy explaining his personal story inspiring the short before giving a live performance of the titular song. Colin Levy followed Murphy’s conceptualization, story boarding, and clay mockups with how the film would be framed for maximal emotional impact. Levy explain the exploratory process of filming the opening scene of the film to find the right combination of lenses, and flight paths based on real-world references to help illustrate the size and scale of Uku, the hopeless volcano.
Both Aaron Hartline and Austin Lee continued discussing the challenges of animating and rigging Uku, Lele, a pair of dolphins, birds, whales, and turtles (the last four representing young love, newly weds, established lives, and life long love). In particular, the different approaches for animating and rigging the facial features of Uku (eyelids, lips, checks, and so on) and how the teams iterated to find a balance between what the audience might expect from an anthropomorphic mountain and what they wanted to achieve as story tellers.
Perhaps the most interesting moment in the presentation was Dirk Van Gelder’s sneak peak of the enhancements the team made to Presto (Pixar’s in-house animation tool) to provide animators final render quality real-time feedback of their changes through a clever combination of Renderman-based final renders and OpenGL hardware texturing. Aside from the technical novelty, it’s a great example of time saving enhancements that make it easier for people to freely experiment and explore different approaches leading to better results.
The closing discussion by Byron Bashforth and Farhez Rayani on shading and lighting was informative and it was interesting to see how the procedural approaches were done to give Uku both a physically realistic and visually appealing biome consisting of different shaders, and static and procedural assets. Overall, a very interesting peak into the workflow of one of the most venerable studios in the industry.
Birds of a Feather
Having worked in the healthcare space for a fair bit of time, I was attracted to meetings on Volume Rendering and Medical Visualization and HealthTech: Modeling, Interaction, Hardware, and Analysis to see what people have been working on and to get a glimpse of where things are heading.
![]() |
Nicholas Polys of Virginia Tech and Michael Aratow (MD) (both chairs of the Web3D Consortium Medical Working Group) began the medical visualization discussion by going over common libraries such as VTK (The Visualization Toolkit) and Voreen (Volume Rendering Engine), before discussing general purpose analysis and visualization tools such as Paraview. Volume oriented applications such as Seg3D (volume segmentation tool), OsiriX (DICOM viewer) were covered and finally, tools for exploring biomolecular systems such as Chimera, VMD (Visual Molecular Dynamics) and PathSim (Epstein-Barr Virus exploration) were discussed giving the audience a good lay of the land. Brief bit of time was given to surgical training tools based on 3D technologies and haptic feedback (e.g. H3D).
These were all interesting applications and seeing how they all work using different types of human-machine interfaces (standard workstations, within CAVE environments, or even in virtual reality headsets and gloves) was eye opening. The second main theme of the discussion was on standardization when it comes to interoperability and reproducibility. There was a heavy push for X3D along with interoperability with DICOM. Like a lot of massive standards, DICOM has some wiggle room in it that leads to inconsistent implementations from vendors. That makes portability of data between disparate systems complicated (not to mention DICOM incorporates non-graphical metadata such as complex HL7). Suffice to say X3D is biting off a big chunk of work, and I think it will take some time for them to make progress in healthcare since it’s a fragmented industry that is not in the least bit technologically progressive.
One area I felt was absent during the discussion was how 3D graphics could be used to benefit everyday patients. There is a wealth of fMRI and ECoG data that patients could benefit from seeing in an accessible way- for example showing a patient a healthy baseline, then accentuating parts of their own data and explaining how those anomalies affect their well-being. If a component can be developed to deliver that functionality, then it can be incorporated into a patient portal alongside all other charts and information that providers have accumulated for the patient.
The HealthTech discussion was presented by Ramesh Raskar, and his graduate students and postdocs from the MIT Media Lab. They presented a number of low-cost, low-power diagnostic devices for retinal imaging and electroretinography, high-speed tomography, cellphone-based microscopy, skin perfusion photography, and dental imaging. Along with more social oriented technologies for identify safe streets to travel, and automatically discerning mental health from portraits. There were plenty of interesting applications being developed by the group, but it was more of a show and tell by the group than discussing the types of challenges beyond the scope of the work by MIT Media Lab (as impressive as they are). (For example, The fine work 3Shape A/S has done with fast scanning of teeth for digital dentistry.)
One thing that was discussed of key interest was Meddit a way for medical practitioners and researchers to define open problems to maturity, then presenting those challenges to computer scientists to work on and develop solutions. While the company name is uninspired, I think this is the right kind of collaboration platform for the “toolmaker” view of hardware engineers, computer scientists and software engineers as it identifies a real issue, presents an opportunity, and gives a pool of talented, bright people a way to make a difference. I am skeptical that it will take off (I think it would have more success as a niche community within an umbrella collaboration platform- i.e. Stack Exchange model), but the idea is sound and something people should get excited about.
Real-Time Live!
The challenge of real-time graphics is very appealing to me and getting to see what different software studios are working on was a real treat. While there were several presentations and awards given during the two hour long event, three demos stood out to me. Balloon Burst given by Miles Macklin of NVIDIA, BabyX presented by Mark Sagar of University of Auckland, and award winner A Boy and His Kite demoed by Nick Penwarden of Epic Games.
![]() |
Macklin’s demo was impressive in that it simulated more than 750,000 particles (250,000 by their solver Flex, and 512,000 for mist and droplets) and their paper [pdf] Fast Grid-Free Surface Tracking gave some technical background into how they achieved their results. Fluid simulation is something I’d like to spend some time exploring, obviously won’t be able to create something as technical as Macklin’s group, but would like to spend some time on Smoothed-Particle Hydrodynamics, and seeing NVIDIA’s work was a good motivation boost to explore the subject further on my own.
Perhaps the most unexpected entry in the series was Sagar’s BabyX. It was a fascinating assemblage of neural networks, real time graphics, natural language processing, computer vision, and image processing to create the ultimate “Sims” like character- a baby that could learn and invoke different emotional responses based on external stimuli. Real-time graphics were photorealistic, and seeing the modeling behind the system to emulate how the brain behaves in the presence of different dopamine levels (and how those levels correspond to things like Parkinson’s and schizophrenia) was impressive as well. Overall, a fantastic technical achievement and I look forward to following Sagar’s work as it continues to evolve.
My main interest in going to Real-Time Live! was to see Penwarden’s work on A Boy and His Kite. This impressive demo spanning hundred square miles inspired by the Isle of Skye really puts to shame my prior work in creating procedural environments. Nonetheless, it goes to show to far the medium can be pushed and how small the divide between real-time and film is becoming. Computer Graphics World published (July-August 2015) a very thorough technical overview [p. 40-48] of how Penwarden’s team produced the short, in addition to the features added to Unreal Engine 4 to make the demo shine.
Wrap-up
There were many other things I explored that I won’t go into detail- namely the VR Village, Emerging Technologies, Research Posters, Exhibition, and Job Fair. I’m still quite skeptical that virtual reality (and to the same extent augmented reality) technologies will come into the mainstream; I think they’ll continue to be the subject of researchers, gaming enthusiasts, and industry solutions for automotive, and healthcare problems. One thing that was a bit of a disappointment was the Job Fair as there were barely any companies participating. Overall, a positive experience learning what other people are doing in the industry, and getting to see how research is being applied in a variety of different domains including automotive, entertainment, engineering, healthcare, and science.
Deep Learning for Automatic Speech Recognition
Introduction
The problem of automatic speech recognition, and details of the traditional Hidden Markov Model and Gaussian Mixture Model hybrid architecture (HMM-GMM) for acoustic modeling are detailed in [JM08], but will be skipped here. Instead, the focus of this literature review is to discuss how [DYDA12] uses a context dependent Hidden Markov Model and Deep Neural Network hybrid architecture (CD-HMM-GMM) for acoustic modeling as it represents a significant improvement over the traditional HMM-GMM approach. This review will begin with motivation for the architecture, then go into detail the algorithms used for pre-training, and outline the algorithms used for training before concluding with how well the approach outperforms the standard HMM-GMM approach.
Architecture
To motivate their architecture, [DYDA12] rely on the standard noisy channel model for speech recognition presented in [JM08] where we wish to maximize the likelihood of a decoded word sequence given our input audio observations:
(1) |
Where and
represent the language and acoustic models respectively. [JM08] state that the language model can be computed via an N-gram model; [DYDA12] acknowledge using this approach, but focus their efforts into explaining their acoustic model:
(2) |
Here the acoustic model is viewed as a sequence of transitions between states of tied-state triphones which [DYDA12] refer to as senones giving us the context dependent aspect of the architecture. [FLMS14] explains that senones represent the pronunciation of words and are derived by decision trees. By tying triphone states together, this approach is able to avoid having to process a large number of triphones and avoid the likely sparseness of training examples for every possible triphone.
The model assumes that there is a probability for the starting state, probabilities
of transitioning to the state observed at step
to step
, and finally, the probability of the acoustics given the current state
. [DYDA12] expand this last term further into:
(3) |
Where models the tied triphone senone posterior given mel-frequency cepstral coefficients (MFCCs) based on 11 sampled frames of audio. While MFCCs come from signal processing, they have proven to be effective features for automatic speech recognition. Based on the power spectrum derived from sample audio frames, MFCCs represent characteristics of the audio that our ears are sensitive to as explained in [Ada10].
is the prior probability of the senone, and
can be ignored since it does not vary based on the decoded word sequence we are trying to find.
Based on this formalism, [DYDA12] chose to use a pre-trained Deep Neural Network to estimate using MFCCs as DNN inputs and taking the senone posterior probabilities as DNN outputs. The transitioning between events is best modeled by a Hidden Markov Model whose notation,
appears in Eq. (2). Now that we have an overview of the general CD-DNN-HMM architecture, we can look at how [DYDA12] train their model.
Pre-Training
Given the DNN model we wish to fit the parameters of the model to a training set. This is usually accomplished by minimizing a likelihood function and deploying a gradient descent procedure to update the weights. One complication to this approach is that the likelihood can be computationally expensive for multilayer networks with many nodes rendering the approach unusable. As an alternative, one can attempt to optimize a computationally tractable surrogate to the likelihood. In this case the surrogate is the contrastive divergence method developed by [Hin02]. This sidestep enabled [HOT06] to develop an efficient unsupervised greedy pre-training process whose results can then be refined using a few iterations of the traditional supervised backpropagation approach. In this portion of the paper we discuss the work of [Hin02] and explain the greedy algorithm of [HOT06] before going on to discuss the high-level training procedure of [DYDA12].
To understand the pre-training process, it is necessary to discuss the Restricted Boltzmann Machine (RBM) and Deep Belief Network (DBN) models. RBMs are an undirected bipartite graphical model with Gaussian distributed input nodes in a visible layer connecting to binary nodes in a hidden layer. Every possible arrangement of hidden, , and visible,
, nodes is given an energy under the RBM model:
(4) |
Where is the weight of connections between nodes and vectors
and
correspond to the visible and hidden biases respectively. The resulting probability is then given by:
(5) |
Where is a normalization factor. Based on the assumptions of the RBM, [DYDA12] derive expressions for
and
given by:
(6) |
Where is an element-wise logistic function. [DYDA12] argue that Eq. (6) allows one to repurpose the RBM parameters to initialize a neural network. Training of the RBM is done by stochastic gradient descent against the negative log likelihood since we wish to find a stable energy configuration for the model:
(7) |
however [DYDA12] point out that the gradient of the negative log likelihood cannot be computed exactly since the term takes exponential time. As a result, the contrastive divergence method is used to approximate the derivative:
(8) |
where is a single step Gibbs sampled expectation. These terms are expectations in which nodes
are simultaneously active given the training data and model. Given this insight, regular stochastic gradient descent can be performed and the parameters of a RBM fitted to training data.
Now that we have an understanding of RBMs, we can shift our focus to DBNs. A Deep Belief Network is a multilayer model with undirected connections between the top two layers and directed between other layers. To train these models, [HOT06] had the insight to treat adjacent layers of nodes as RBMs. One starts with the bottom two layers and trains them as though they were a single RBM. Once those two layers are trained, then the top layer of the RBM is treated as the input layer of a new RBM with the layer above that layer acting as the hidden layer of the new RBM. The sliding window over the layers continues until the full DBN is trained. After this, [HOT06] describe an “up-down” algorithm to further refine the learned weights. The learned parameters of this greedy approach can then be used as the parameters of a DNN as explained earlier in the discussion of Eq. (6).
Training
Training of the CD-DNN-HMM model consists of roughly a dozen involved steps. We won’t elaborate here on the full details of each step, but will instead provide a high-level sketch of the procedure to convey its general mechanics.
The first high-level step of the procedure is to initialize the CD-DNN-HMM model. This is done by first training a decision tree to find the best tying of triphone states which are then used to train a CD-GMM-HMM system. Next, the unique tied state triphones are each assigned a unique senone identifier. This mapping will then be used to label each of the tied state triphones. (These identifiers will be used later to refine the DNN.) Finally, the trained CD-GMM-HMM is converted into a CD-DNN-HMM by retaining the triphone and senone structure and HMM parameters. This resulting DNN goes through the previously discussed pre-training procedure.
The next high-level step iteratively refines the CD-DNN-HMM. To do this, first the originally trained CD-GMM-HMM model is used to generate a raw alignment of states which is then mapped to its corresponding senone identifier. This resulting alignment is then used to refine the DBN by backpropagation. Next, the prior senone probability is estimated based on the number of frames paired with the senone and the total number of frames. These estimates are then used to refine the HMM transition probabilities to maximize the features. Finally, if this newly estimated parameters do not improve accuracy against a development set, then the training procedure terminates; otherwise, the procedure repeats this high-level step.
Experimental Results
System Configurations
[DYDA12] report that their system relies on nationwide language model consisting of 1.5 million trigrams. For their acoustic model, they use a five hidden layer DNN with each layer containing 2,048 hidden units. Training the system from scratch on 24 hours of training data takes four days on a Dell T3500 workstation with an NVIDIA Tesla GPU. [DYDA12] emphasize the importance of the GPU in obtaining acceptable training time, and that without it, training time would be 30x slower.
Datasets and Metrics
Comparison of automatic speech recognition system consists of three principle error metrics: sentence (SER), word (WER), and phoneme (PER) error rates. These look at the ratio of incorrect entities to the number of total entities with the exception of word error rate which uses a Levenshtein approach to measure the number of insertions, substitutions, and deletions relative to the total number of words. A sentence is considered incorrect if there is at least one incorrect word.
These error metrics often coincide with different datasets, in particular WER is reported for Switchboard, SER for Bing Mobile Voice Search (BMVS), and PER on TIMIT. Switchboard is a collection of phone conversations between two people, while BMVS is a collection of short spoken questions such as “The Med” or “Chautauqua Park” that are used to find these locations, while TIMIT is a phonetic focused corpus of spoken sentences that are phonetically rich.
Results
Switchboard | BMVS | TIMIT | |
(WER) | (SER) | (PER) | |
GMM | 23.6[2] | 36.2[1] | 21.7[2] |
DNN | 16.1[2] | 30.4[1] | 21.9[3] |
CNN | – | – | 20.2[3] |
RNN | – | – | 17.7[4] |
Direct comparison of models is complicated by the variety of error metrics and datasets; [DBL12] is used to fill in these gaps to make a meaningful comparison. As one can see from Table (1), the neural network approaches do better on average over the traditional GMM approach. To illustrate that it is not only DNN approaches that do better, the work of [AMJ+14] using a Convolutional Neural Network (CNN) and [GMH13] using a Recurrent Neural Network (RNN) are included to further drive the point that neural network architectures are viable alternatives to GMMs.
Conclusions
[DYDA12], [AMJ+14], and [GMH13] have shown that neural network architectures exhibit better performance over Gaussian Mixture Models. [DYDA12] believes that a more capable first layer model provided by mean-covariance restricted Boltzmann machines will increase performance, while [AMJ+14] plans to investigate unexpected improvements in large-vocabulary speech recognition where they were absent in phone recognition tasks when using convolutional restricted Boltzmann machines. Both routes seem promising and are likely to produce improved error rates inline with [GMH13]’s results.
In [DBL12], the authors of both research groups suggest key gains will come from improved understanding of the pre-training process and how the types of units used in these models affect error rates. They conclude that distributed training is the largest hurdle to overcome for these systems to make use of more training data. (Parallelization is limited by the sequential stochastic gradient descent at the heart of the pre-training and training processes.) As [DYDA12] point out in their paper, GPU-based approaches can assist in reducing computation time, but more foundational approaches need to be pursued.
In a 2014 talk [Hin14], Hinton criticizes existing neural network architectures on philosophical grounds arguing that they do not correspond well enough to how the brain functions citing inadequate structural complexity. His proposed solution is a new neural network approach that clusters neurons together into capsules, which he believes will better model how the cortical columns of the brain behave. If Hinton is right (which his track record suggests), then it is likely we’ll see this capsule approach outperform existing models, and consequently, yield improved error rates in automatic speech recognition.
References
[Ada10] Andre Gustavo Adami. Automatic speech recognition: From the beginning to the portuguese language. In The Int. Conf. on Computational Processing of Portuguese (PROPOR). Rio Grande do Sul: Porto Alegre, 2010.
[AMJ+14] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Ui. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech & Language Processing, 22(10):1533-1545, 2014.
[DBL12] Deep neural networks for acoustic modeling in speech recognition: The shared views of four research grounds. IEEE Signal Process. Mag., 29(6):82-97, 2012.
[DYDA12] George E. Dahl, Dong Ui, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, speech & Language Processing, 20(1):30-42, 2012.
[FLMS14] Luciana Ferrer, Yun Lei, Mitchell McLaren, and Nicolas Scheffer. Spoken language recognition based on senone posteriors. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communnication Association, Singapore, September 14-18, 2014, pages 2150-2154. ISCA, 2014.
[GMH13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton, Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 6645-6649, 2013.
[Hin02] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771-1800, 2002.
[Hin14] Geoffrey E. Hinton. What’s wrong with convolutional nets? Massachusetts Institute of Technology, Department of Brain and Cognitive Sciences, Fall Colloquium Series, 2014.
[HOT06] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527-1554, 2006.
[JM08] Daniel Jurafsky and James H. Martin. Speech and Language Processing, 2nd Edition. Prentice Hall, 2008.
Parallel Merge Sort in Java
Introduction
This past November I was a pretty busy getting settled into a new job and trying to balance life’s other priorities. With a new job also came a new technology stack and while I’ll continue to do C# development in my free time, I’m going to be going back to doing Java development after a seven year hiatus. Before starting the new job, I decided to refresh my memory of the language’s finer details when it comes to generics and threading. So, I decided to implement something simple and settled on a parallel implementation of merge sort. This article is going to focus on making use of Java’s various features and evaluating the theoretical and empirical run time performance of the sequential and parallel versions of the algorithm.
Sequential Approach
Specification
Given a list of values, the list is sorted by employing a divide and conquer method that partitions the list into two (roughly) equal sized partitions, followed by recursively sorting each partition and then merging the two resulting sorted partitions into the final sorted list.
Pseudocode
|
|
Time Complexity
In terms of time complexity, the algorithm is on the order of . To show this, observe that the input size,
, is divided into to two equal parts,
, followed by a merge operation,
. This leads to the recurrence relation given by
. By induction, the recurrence relation is reduced to
. Observing that the merge function is on the order
, i.e.,
, then the expression reduces further to
and
. As the number of subdivisions increases, eventually
will be reduced to
. As such, let
which implies
which implies
, and thus
. Therefore,
Implementation
In attempting to implement a generic version of merge sort there were a few matters that needed to be addressed. First, the type being sorted required an order relation to be specified so that the merge operation could take place. This is facilitated by restricting the type parameter T
to Comparable<T>
. Secondly, I had forgotten that you can’t initialize arrays of generics in Java like you can in C# [1]. To workaround this limitation, I settled on specifying the desired operations over implementations of the List<T>
interface. Finally, since the List<T>
interface makes no guarantees that its implementations provide (near) constant time reading or writing of elements from the list, an additional generic parameter, L
, was added so that only those implementations of the List<T>
and RandomAccess
[2] interfaces could use this implementation of merge sort. The rest of the implementation is a near facsimile of the pseudocode.
package com.wordpress.antimatroid; import java.util.List; import java.util.RandomAccess; public interface IListOperation <T, L extends List<T> & RandomAccess> { L execute(); }
package com.wordpress.antimatroid; import java.util.ArrayList; import java.util.List; import java.util.RandomAccess; public class CopyListOperation <T, L extends List<T> & RandomAccess> implements IListOperation<T, L> { private final L source; private final int length, initialIndex; public CopyListOperation(L source, int length, int initialIndex) { if(source == null) throw new IllegalArgumentException("source must be non-null."); if(length < 0) throw new IllegalArgumentException(String.format( "length, %d, must be greater than or equal to zero.", length )); if(initialIndex < 0) throw new IllegalArgumentException(String.format( "initialIndex, %d, must be greater than or equal to zero.", initialIndex )); if(initialIndex + length > source.size()) throw new IllegalArgumentException(String.format( "initialIndex, %d, + length, %d, must be less than or equal to source.size(), %d.", initialIndex, length, source.size() )); this.source = source; this.length = length; this.initialIndex = initialIndex; } @Override public L execute() { L destination = (L) new ArrayList<T>(length); for(int i = 0; i < length; i++) destination.add(i, source.get(initialIndex + i)); return destination; } }
package com.wordpress.antimatroid; import java.util.ArrayList; import java.util.List; import java.util.RandomAccess; public class MergeListOperation <T extends Comparable<T>, L extends List<T> & RandomAccess> implements IListOperation<T, L> { private final L a, b; public MergeListOperation(L a, L b) { if(a == null) throw new IllegalArgumentException("a must not be null."); if(b == null) throw new IllegalArgumentException("b must not be null."); this.a = a; this.b = b; } @Override public L execute() { int length = a.size() + b.size(); L c = (L) new ArrayList<T>(length); int i = 0, j = 0; for(int k = 0; k < length; k++) { if(i < a.size() && j < b.size()) { if(a.get(i).compareTo(b.get(j)) <= 0) { c.add(k, a.get(i++)); } else { c.add(k, b.get(j++)); } } else if (i < a.size() && j >= b.size()) { c.add(k, a.get(i++)); } else if (i >= a.size() && j < b.size()) { c.add(k, b.get(j++)); } else { break; } } return c; } }
package com.wordpress.antimatroid; import java.util.List; import java.util.RandomAccess; public class MergeSortListOperation < T extends Comparable<T>, L extends List<T> & RandomAccess > implements IListOperation<T, L> { private final L a; public MergeSortListOperation(L a) { if(a == null) throw new IllegalArgumentException("a must not be null."); this.a = a; } @Override public L execute() { if(a.size() <= 1) return a; CopyListOperation<T, L> leftPartition = new CopyListOperation<T, L>(a, (a.size() / 2) + a.size() % 2, 0); CopyListOperation<T, L> rightPartition = new CopyListOperation<T, L>(a, (a.size() / 2), (a.size() / 2) + a.size() % 2); MergeSortListOperation<T, L> leftSort = new MergeSortListOperation<T, L>(leftPartition.execute()); MergeSortListOperation<T, L> rightSort = new MergeSortListOperation<T, L>(rightPartition.execute()); MergeListOperation<T, L> merge = new MergeListOperation<T, L>(leftSort.execute(), rightSort.execute()); return merge.execute(); } }
Run Time Analysis
Noting that the theoretical time complexity is , inputs of the form
will yield a
curve. Taking the logarithm of which will give
. Observing that as
increases the linear term will dominate the expression. As a result, the curve should look near linear in logarithmic space with the exception of small values of
. Which means that conducting a linear least squares regression of the empirical run times in logarithmic space will yield a satisfactory approximation to the theoretical time complexity.
To verify that the implementation follows the theoretical time complexity, increasing values of were used to generate lists containing
random values. These lists were then sorted and the
System.nanoTime()
before and after values were used to determine the elapsed time. These values were collected and a total of 50 identical trails were conducted on an Intel Core i7-2630QM CPU @ 2.00 GHz based machine with 6.00 GB RAM.
As presented in the plot, the regressed linear model in logarithmic space yields a satisfactory theoretical curve whose relative error to the empirical curve diminishes to zero as the input size increases.
Parallel Approach
Specification
The parallel implementation operates under the premise that the divide portion of merge sort can be easily parallelized by sorting one partition on the present thread and sorting the other partition on a secondary thread. Once the secondary thread has completed, then the two threads join, and consequently, the two sorted lists are merged. To avoid copious thread creation, whenever the input size is less than a threshold, , the sequential version of the algorithm is used.
This process can be easily visualized below where each left-hand branch is the originating thread processing the first partition, each right-hand branch is the secondary thread processing the second partition and the junction of those edges represents the consequent merge operation after the secondary thread as joined back in with the originating thread.
Time Complexity
The introduction of parallelism changes the original recurrence relation to the following:
Assuming, , and that there is no asymptotic difference in sorting the first and second partition, then the time complexity is on the order of
. To see this, observe that the recurrence relation becomes
under the presented assumtions. Following the same process of induction as in the sequential case, the recurrence relation reduces to
and is simplified further under the assumption
to
. Observing that the sum is a finite geometric series leads to
and under the same reduction argument as before to
. Thus, the time complexity of the parallel merge sort specified is
Assuming , then the time complexity of the algorithm is still on the order
. Thus, for various values of
and
, the time complexity is between
.
Implementation
In terms of parallelizing the sequential implementation, an addition interface, IThreadedListOperation
was added to provide a BeginOperation
, EndOperation
asynchronous programming model found in the .net world. After looking around the Java world, I didn’t encounter a preferred idiom, so I went with what I knew.
As I mentioned in the sequential approach, the original data structures were going to be arrays which have a guarantee of providing thread safe reads, but not necessarily thread safe writes. To avoid the issue all together, I decided that the IListOperations
should always return a new List<T>
instance so that only one thread at a time would be reading or manipulating that memory. Since I knew my implementation would not be sharing IListOperations
between threads, I decided not to gold plate the implementation with synchronization constructs. If in the future such ability were required, I would go back and modify the code accordingly.
For the parallel implementation I took advantage of the fact that method arguments are evaluated left-to-right [3] to save one some space, but if the specification ever changed, then it would be more appropriate to move the out the leftSort.execute()
and rightSort.executeEnd()
methods up a line to form a more explicit operation.
package com.wordpress.antimatroid; import java.util.List; import java.util.RandomAccess; abstract public class IThreadedListOperation <T, L extends List<T> & RandomAccess> implements Runnable, IListOperation<T, L> { private Thread thread; public void executeBegin() { if(thread != null) throw new IllegalStateException(); thread = new Thread(this); thread.start(); } public L executeEnd() { if(thread == null) throw new IllegalStateException(); try { thread.join(); } catch (InterruptedException e) { } return getResult(); } public L execute() { if(thread != null) throw new IllegalStateException(); run(); return getResult(); } abstract protected L getResult(); }
package com.wordpress.antimatroid; import java.util.List; import java.util.RandomAccess; public class MergeSortThreadedListOperation <T extends Comparable<T>, L extends List<T> & RandomAccess> extends IThreadedListOperation<T, L> { private final L a; private L b; private final int threshold; public MergeSortThreadedListOperation(L a) { this(a, 1024); } public MergeSortThreadedListOperation(L a, int threshold) { if(a == null) throw new IllegalArgumentException("a must be non-null."); if(threshold <= 0) throw new IllegalArgumentException("threshold must be greater than zero."); this.a = a; this.threshold = threshold; } @Override public void run() { if(a.size() <= 1) { b = a; return; } if(a.size() <= threshold) { MergeSortListOperation<T, L> mergeSort = new MergeSortListOperation<T, L>(a); b = mergeSort.execute(); return; } CopyListOperation<T, L> leftPartition = new CopyListOperation<T, L>(a, (a.size() / 2) + a.size() % 2, 0); MergeSortThreadedListOperation<T, L> leftSort = new MergeSortThreadedListOperation<T, L>(leftPartition.execute()); CopyListOperation<T, L> rightPartition = new CopyListOperation<T, L>(a, (a.size() / 2), (a.size() / 2) + a.size() % 2); MergeSortThreadedListOperation<T, L> rightSort = new MergeSortThreadedListOperation<T, L>(rightPartition.execute()); rightSort.executeBegin(); MergeListOperation<T, L> merge = new MergeListOperation<T, L>(leftSort.execute(), rightSort.executeEnd()); b = merge.execute(); } @Override protected L getResult() { return b; } }
Run Time Analysis
Noting that the time complexity for the parallel approach is , a simple linear least squares regression of the empirical run times in normal space will yield a satisfactory approximation to the theoretical time complexity.
The trial methodology used in the sequential run time analysis is used once again to produce the following plot. Note that it begins at 2048 instead of 1. This was done so that only the parallel implementation was considered and not the sequential implementation when the input size is .
As presented in the plot, the regressed linear model in logarithmic space yields a satisfactory theoretical curve whose relative error to the empirical curve diminishes to zero as the input size increases.
Threshold Selection
As a thought experiment, it makes sense that as the threshold approaches infinity, that there is no difference between the sequential implementation and parallel one. Likewise, as the threshold approaches one, then the number of threads being created becomes exceedingly large and as a result, places a higher cost on parallelizing the operation. Someplace in the middle ought to be an optimal threshold that yields better run time performance compared to the sequential implementation and a pure parallel implementation. So a fixed input size should produce a convex curve as a function of the threshold and hence have a global minimum.
Conducting a similar set of trials as the ones conducted under the analysis of the sequential run time give a fully parallel and sequential curve which to evaluate where the optimal threshold resides. As the plot depicts, as the threshold approaches one, there is an increase in the processing taking the form of a convex curve. As the threshold exceeds the input size, then the sequential approach dominates. By conducting a Paired T-Test against the means of the two curves at each input size, 1024 was determined to be the optimal threshold based on the hardware used to conduct the trials. As the input size grows, it is evident that for thresholds less than 1024, the sequential approach requires less time and afterwards, the parallel approach is favorable.
Conclusion
In comparing the sequential and parallel implementations it was observed that the specified parallel implementation produced as much as a 2.65 factor improvement over the specified sequential implementation for megabyte sized lists.
Larger sized lists exhibited a declining improvement factor. It is presumed that as the input size grows that the amount of memory being created is causing excessive paging and as a result increasing the total run time and consequently reducing the improvement factor. To get around this limitation, the algorithm would need to utilize an in-place approach and appropriate synchronization constructs put into place to guarantee thread safety.
From a theoretical point of view, the improvement factor is the ratio of the run time of the sequential implementation to the parallel implementation. Using the time complexities presented, . Taking the limit as the input size grows to infinity gives
. So if there is any upper bound to the improvement factor it should be purely technical.
Footnotes
[1] This design decision is discussed in §4.7 of the Java Language Specification (3rd Edition) on reifiable types.
[2] The only two java.util classes providing this guarantee are ArrayList
and Vector
. Both of which implement the interface RandomAccess
which is intended indicate that the class provides the (near) constant reading and writing of elements.
[3] The left-to-right order of operations is specified by §15.7.4 of the Java Language Specification (3rd Edition). Also worth noting the specification recommends against the practice of relying on this convention however in §15.7:
… It is recommended that code not rely crucially on this specification. Code is usually clearer when each expression contains at most one side effect, as its outermost operation, and when code does not depend on exactly which exception arises as a consequence of the left-to-right evaluation of expressions.
Category Recognition of Golden and Silver Age Comic Book Covers
Introduction
Motivation
For a while now, I’ve been wanting to work on a computer vision project and decided for my next research focused project that I would learn some image processing and machine learning algorithms in order to build a system that would classify the visual contents of images, a category recognizer. Over the course of the summer I researched several techniques and built the system I had envisioned. The end result is by no means state of the art, but satisfactory for four months of on-and-off development and research. The following post includes my notes on the techniques and algorithms that were used in the project followed by a summary of the system and its performance against a comic book data set that was produced during development.
Subject Matter
The original subject matter of this project were paintings from the 1890s done in the Cloisonnism art style. Artists of the style are exemplified by Emile Bernard, Paul Gaugin and Paul Serusier. The style is characterized by large regions of flat colors outlined by dark lines; characteristics that would be easy to work with using established image processing techniques. During development, it became evident that no one approach would work well with these images. As an alternative, I decided to work with Golden and Silver Age comic book covers from the 1940s to 1960s which also typified this art style. Many of the comic books were drawn by the same individuals such as Jack Kirby, Joe Shuster and Bob Kane. As an added benefit, there are thousands of comic book covers available online compared to the dozens of Cloisonnism paintings.
Image Processing
Representation
An image is a function, , where each input vector,
, represents an image coordinate and each output vector,
, represents the red, blue and green (RGB) channels,
, of an image. Individual input values are bound between zero and the width,
, or height,
, of the image and output values are between zero and
. Each output vector represents a unique color in RGB space giving rise to
possible colors. Below is a basic sample image broken down into to its individual channels.
![]() |
Like any other vector field, transformations can be applied to the image to yield a new image, . In image processing, these transformations are referred to as image filters and come in three varieties of point-based, neighbor-based and image-based filters. As the names suggest, point-based filters map single output vectors to a single output vector, neighbor-based filters map neighboring output vectors to a single output vector, and image-based filters map the whole image and a single or neighboring set of output vectors to a single output vector.
There are many different instances of these types of filters, but only those used in this project are discussed below. Computational complexity and efficient algorithms for each type of filter are also discussed where appropriate.
Point-based Filters
Point-based filters, , map an output vector to a new output vector in the form
. Application of a point-based filter is done in quadratic time with respect to the dimensions of the image
.
Grayscale Projection
It is helpful to collapse the RGB channels of an image down to a single channel for the purpose of simplifying filter results. This can be done by using a filter of the form . Alternatively one can use a filter of the form
to represent the luminescence of the output vector.
Thresholding
A threshold filter serves as a way to accentuate all values in the image greater than or equal to a threshold, , or to attenuate all those values less than the threshold.
The first variety is the step threshold filter, , which exhibits an ideal threshold cutoff after the threshold value.
The second variety is a logistic threshold filter, , with an additional parameter,
, allowing for wiggle room about the threshold yielding a tapered step function as
increases in size.
![]() |
Neighbor-based Filters
All neighbor-based filters take the output vectors neighboring an input vector to calculate a new output vector value. How the neighboring output vectors should be aggregated together is given by a kernel image, , and the computation is represented as a two-dimensional, discrete convolution.
Neighbor-based filters can be applied naïvely in quartic time as a function of the image and kernel dimensions, . However, a more efficient implementation allows for
time by way of the Discrete Fourier Transform.
The Discrete Fourier Transform is a way of converting a signal residing in the spatial domain into a signal in the frequency domain by aggregating waveforms of varying frequencies where each waveform is amplified by its corresponding value in the input signal. The Inverse Discrete Fourier Transform maps a frequency domain signal back to the spatial domain.
Applying the Discrete Fourier Transform to a convolution, , comes with the convenient property that the transformed convolution can be rewritten as the product of the transformed functions,
, by way of the Convolution Theorem.
The improved time complexity is achieved by using a divide a conquer algorithm known as the Fast Fourier Transform which takes advantage of the Danielson-Lanczos Lemma which states that the Discrete Fourier Transform of a signal can be calculated by splitting the signal into two equal sized signals and computing their Discrete Fourier Transform.
For the purposes of image processing, we use the two-dimensional Discrete and Inverse Discrete Fourier Transform.
The expression can be rearranged to be the Discrete Fourier Transform of each column in the image and then computing the resulting Discrete Fourier Transform of those results to obtain the full two-dimensional Discrete Fourier Transform.
As a result, we can extend the Fast Fourier Transform in one dimension easily into two dimensions producing a much more efficient time complexity.
Weighted Means: Gaussian and Inverse Distance
Weighted mean filters are used to modify the morphology of an image by averaging neighboring output vectors together according to some scheme.
A Gaussian filter is used to blur an image by using the Gaussian distribution with standard deviation, , as a kernel.
The inverse distance filter calculates how far the neighboring output vectors are with respect to the new output vector being calculated. Each result is also scaled by the parameter, , allowing for contrast adjustments.
Laplacian
A Laplacian filter detects changes in an image and can be used for sharpening and edge detection. Much like in calculus of a single variable, the slope of a surface can be calculated by the Gradient operator, . Since it is easier to work with a scalar than a vector, the magnitude of the gradient is given by the Laplacian operator,
.
Since an image is a discrete function, the Laplacian operator needs to be approximated numerically using a central difference. represents the spacing between successive samples of the underlying function. Since the finest resolution that can be achieved in an image is an individual displacement,
.
![]() |
Image-based Filters
Image-based filters calculate some information about the contents of the image and then use that information to generate the appropriate point-based and neighbor based filters.
Normalization
The normalization process computes the minimum, and maximum,
values of each channel and linearly maps all values between those extrema to new values between the possible channel extrema of
and
.
This particular image-based filter can be applied in quadratic time, , to calculate the extrema of the image and apply the linear map.
![]() |
Edge Detection
Edge detection is the process of identifying changes (e.g., texture, color, luminance and so on) in an image. As alluded to in the image processing section, the Laplacian filter is central to detecting edges within an image. As a result A sequence of filters is used before and after a Laplacian filter to produce a detector that consistently segments comic book covers. The following sequence of filters was used.
- Grayscale Projection – Since all logical components of a comic book cover are separated by inked lines, it is permissible to ignore the full set of RGB channel information and collapse the image down to a grayscale image.
- Normalization – It is conceivable that the input image has poor contrast and brightness. To ensure that the full range of luminescence values are presented, the image is normalized.
- Gaussian (
) – An image may have some degree of noise superimposed on the image. To reduce the noise, the image is blurred using a Gaussian filter with a standard deviation of
. This is enough to smooth out the image without distorting finer image detail.
- Laplacian – Once the image has been prepared, its edges are calculated using the Laplacian filter.
- Normalization – Most of the changes in the image may be subtle and need to make sure that all edge information is accentuated as much as possible by applying a normalization filter.
- Step Threshold (
) – Since a partial edge isn’t particularly useful information, any edge RGB value less than
is attenuated to zero and all other values accentuated to
.
- Inverse Distance (
) – It is possible that during the threshold process that discontinuities were introduced into some of the edges. To mitigate this impact, an inverse distance filter is used to inflate existing edges and intensify the result with a gain of
.
The complete edge detection process takes computational complexity of due to the neighbor-based filters used to eliminate noise and smooth edge discontinuities.
Segmentation
With the edge image it is possible to segment the image into its visual components. This is achieved by doing a flood fill on the image and using the edge image as the boundaries for the fill. Once the fill runs out of points to flood, the segment is complete and the next remaining point in the image is considered. To reduce the number of minuscule segments, only those segments representing of the image are included.
![]() |
Machine Learning
Classifiers
The task of classification is to identify decision boundaries separating all of the classification within the data set. Such data sets can be linearly or non-linearly separable and as a result, classifiers were developed to solve the linear case and then adapted to deal with the more complicated non-linear case. While there are a number of classifiers, only the K-Nearest Neighbor and Support Vector Machine classifiers were researched and implemented in this project.
K-Nearest Neighbor
The K-Nearest Neighbor classifier is an online classifier which operates under the assumption that a yet to be classified vector is most likely to be the same classification as those training vectors which are closest to the vector based on a distance measure,
.
Distance can be measured in a variety of ways for arbitrary vectors, , residing in some real space. The most common of which are specialized cases of the Minkowski distance.
The Manhattan distance, , yields the distance traveled along a grid between two vectors (hence a name in reference to the New York City borough). The Euclidean distance,
, gives the distance between the vectors in the usual familiar sense. The last specialized cased considered is the Chebyshev distance,
, which gives the maximum distance between any one dimension of the two vectors.
Two factors affect the efficacy of the algorithm. The first is the dimension of the data, , and the size of the train data set,
. As the training data set increases with size, there are more vectors which a test vector must be compared against. As a result, an efficient means of searching the training set must be used to yield satisfactory performance. This can be achieved by using kd-Trees which give
search performance or branch and bound methods giving similar performance. As the dimensionality of the dataset increases, the efficacy of kd-Trees diminishes to a near linear search of the training data set due to the “curse of dimensionality.”
![]() |
Support Vector Machine
Formulation
The Support Vector Machine classifier is an offline linear, binary classifier which operates under the assumption that a training set, , consists of linearly separable classifications,
, of data,
, by some optimal hyperplane of the form
. Where
is the inner product,
and
. When
, then the classification
is presented and when
, the classification
is presented.
![]() |
The hyperplane is padded by two hyperplanes separated by an equal distance to the nearest training examples of each classification. The span between the supporting hyper planes is the margin. The goal then is to pick a hyperplane which provides the largest margin between the two separable classifications. The margin between the supporting hyperplanes is given by . Given the demarcation criteria, the maximum margin will also be subject to the constraint that all training examples satisfy
. As a result of the objective function and accompanying linear constraint, the problem is stated in terms of its native primal Quadratic Programming form.
subject to
To find the optimal parameters, it is easier to translate the problem into a dual form by applying the technique of Lagrange Multipliers. The technique takes an objective function, , and constraint functions,
, and yields a new function
to be optimized subject to the added constraint
.
subject to
The next step is to differentiate the objective function with respect to the parameters to determine the optimal solution. Since the function is concave, the results will yield the desired maximum constraints.
As a result the dual problem can be written as the following:
subject to
,
Handling of non-linearly separable data
In the event that the data is not linearly separable, then an additional parameter, , is added as a penalty factor for those values that reside on the wrong side of the hyperplane. The derivation for the quadratic program is identical to the one presented above with the exception that the lagrange multipliers now have an upper boundary
.
Non-linear classification
By way of Mercer’s Theorem, the linear Support Vector Machine can be modified to allow for non-linear classification through the introduction of symmetric, positive semidefinite kernel functions, . The idea being that if the data is not linearly separable in its present dimensional space that by mapping it to a higher dimensional space that the data may become linearly separable by some higher dimensional hyperplane. The benefit of a kernel function is that the higher dimensional vector need not be computed explicitly. This “kernel trick” allows for all inner products in the dual representation to be substituted with a kernel.
subject to
,
And the decision hyperplane function then becomes:
The following are some typical kernels:
- Linear –
- Polynomial –
- Radial basis function –
- Sigmoid –
From a practical point of view, only the linear and radial basis function kernels from this list should be considered since the polynomial kernel has too many parameters to optimize and the sigmoid kernel does not satisfy the positive semidefinite kernel matrix requirement of Mercer’s Theorem.
Algorithmic details
The Support Vector Machine classifier can be implemented using a quadratic programming solver or by incremental descent algorithms. Both methods work, but are difficult to implement and expensive to procure. An alternative is the Sequential Minimal Optimization algorithm developed by John Platt at Microsoft Research. The algorithm works by analytically solving the dual problem for the case of two training examples then iterating over all of the lagrange multipliers verifying that the constraints are satisfied. For those that are not, the algorithm computes new lagrange multiplier values. The full details of the algorithm can be found in Platt’s paper.
The time complexity of the algorithm is quadratic with respect to the number of training samples and support vectors .
The time complexity of evaluating the decision function is linear with respect to the number of support vectors .
Multiclass Classification
The classification methods presented in the previous section are utilized as binary classifiers. These classifiers can be used to classify multiple classifications by employing a one-vs-all or all-vs-all approach. In the former a single classification is separated from the remaining classifications to produce classifiers for the
classifications. Each classifier is then used to evaluate a vector and the classifier with the highest confidence is then used to declare the classification.
In the latter, a single classification is compared individually to each other classification resulting in classifiers. All of the classifiers are then evaluated against the test vector and the classification with the greatest consensus from the classifiers is declared the classification of the test vector.
Both methods have their place. The benefit of a one-vs-all approach is that there are fewer classifiers to maintain. However, training a single classifier on a complete data set is time consuming and can give deceptive performance measures. All-vs-all does result in more classifiers, but it also provides for faster training which can be easily parallelized on a single machine and distributed to machines on a network.
Classifier Evaluation
Individual classifiers are evaluated by training the classifier against a data set and then determining how many correct and incorrect classifications were produced. This evaluation produces a confusion matrix.
Predicted Classification | ||||
---|---|---|---|---|
Positive | Negatives | Total | ||
Actual Classification | Positive | (TP) True Positive | (FN) False Negative | (AP) Actual Positives |
Negatives | (FP) False Positive | (TN) True Negative | (AN) Actual Negatives | |
Total | (PP) Predicted Positives | (PN) Predicted Negatives | (N) Examples |
The confusion matrix is used to calculate a number of values which are used to evaluate the performance of the classifier. The first of which is the accuracy and error of the classifier. Accuracy measures the number of instances where the actual and predicted classifications matched up and the error for when they do not.
Since we should expect to get different results each time we evaluate a classifier, the values that we obtain above are sample estimates of the true values that are expected. Given enough trails and measurements, it is possible to determine empirically what the true values actually are. However, this is time consuming and it is instead easier to use confidence intervals to determine what interval of values a measurement is mostly likely to fall into.
Training and Testing
Each of the classifiers presented have some number of parameters that must be determined. The parameters can be selected by having some prior knowledge or by exploring the parameter space and determining which parameters yield optimal performance. This is done by performing a simple grid search over the parameter space and evaluating and attempting to minimize the error.
K-folds cross-validation is used at each grid location to produce a reliable measure of the error. The idea is that a data set is split into disjoint sets. The first set is used as a validation set and the remaining
sets are used in unison as the training data set for the classifier. This process is done on the next set and so on until all
sets have been used as a validation set.
System
Implementation
The system was implemented in C# 4.0 on top of the Microsoft .NET Framework. The user interface was written by hand using the WinForms library. No other third-party libraries or frameworks were used. When possible, all algorithms were parallelized to take advantage of multi-core capabilities to improve processing times.
Summary
The system consists of two modes of operation: training and production. In training, a human classifier labels image segments with an appropriate classification. New image segments are then taken into consideration during the training of machine learning algorithms. Those algorithms producing the lowest error for a given classification are then used in production mode. During production, a user submits an image and each image segment is then evaluated against the available classifiers. Those image segments are then presented to the user with the most likely classification. These two modes along with their workflows and components are illustrated in the following diagram.
![]() |
Training Mode
Data Set Construction
The user interface of the system allows users to add an image segment to a local data set of images. Once added, the image is then processed to yield image segments. The user can then label an image segment by editing the segment and moving on to the next image segment. This allows for easy and efficient human classification of data. If the user does not wish to keep the image, he or she may remove the image from the data set as well.
![]() |
Data Set Cleaning
During the construction phase, errors may be introduced into the data set typically in the case of typos or forgetting which segment was currently being edited. The data set is cleaned by listing out all available classifications and presenting the user with all available segments associated with that classification. The user can then review the image segment as it was identified in the source image. If the user does not wish to keep the classification, he or she may remove the image from the data set as well.
![]() |
Data Set Statistics
The data set consists of 496 comic book covers pulled from the Cover Browser database of comic book covers. The first 62 consecutive published comic book covers where used from Action Comics, Amazing Spider-man, Batman, Captain America, Daredevil, Detective Comics, Superman, and Wonder Woman and then processed by the image processing subsystem yielding 24,369 image segments. 11,463 of these segments represented classifiable segments which were then labeled by hand over the course of two weeks; the remaining segments were then discarded.
![]() |
In total, there were 239 classifications identified in the data set among 18 categories. Text, clothing, geography, and transportation categories accounting for 90% of the data set. Since the majority of classification were incidental, only those classifications having 50 or more image segments were considered by the application leaving a total of 38 classifications.
Classifier Evaluation
For the 38 classifications meeting the minimum criteria for classification, the K-Nearest Neighbor approach worked well in distinguishing between text classifications from other classifications and between intra-text classifications for both all-vs-all and one-vs-all schemes.
![]() |
![]() |
All-vs-All K-Nearest Neighbor Performance. | One-vs-All K-Nearest Neighbor Performance. |
The Support Vector Machine approach presented unremarkable results for both all-vs-all and one-vs-all methods. In the former, only a few pairings resulted in acceptable error rates whereas the later presented only a couple acceptable error rates.
![]() |
![]() |
All-vs-All Support Vector Machine Performance. | One-vs-All Support Vector Machine Performance. |
For both classification methods presented, the all-vs-all method yielded superior results to the one-vs-all method. In comparing the two classifier methods, the K-Nearest Neighbor seems to have done better than the Support Vector Machine approach, contrary to what was expected from literature. Both classifier methods are used in production mode.
Production Mode
Production mode allows the end user to add an image to the data set and then review the most likely classifications produced by evaluating each image segment against the available set of classifiers. The end user is then expected to review each segment and accept or reject the suggested classification. Aside from this additional functionality, production mode is nearly identical in functionality to training mode.
![]() |
Conclusions
The time spent on this project was well spent. I met the objectives that I laid out at the beginning of the project and now have a better understanding of the image processing algorithms and machine learning concepts from a theoretical and practical point of view.
Future Work
Segmentation
One issue with the existing implementation is that it over segments the image. Ideally, fewer segments would be produced that are more closely aligned with their conceptual classification. There are a number of popular alternatives to the approach taken, such as level set methods, which should be further investigated.
Classification
The approach taken to map scaled versions of the image segments to a space is simple to implement, but it did not assist well in the classification process. Alternative mappings such as histogram models should be evaluated in the future to decrease classification times and to determine if classification error rates can be reduced.
System User Interface
While it was disappointing to have spent so much time building a data set only to have to limit what was considered, it assisted me in building a user interface that had to be easy and fast to use. The application can certainly be developed further and adapted to allow for other data sets to be constructed, image segmentation methods to be added and additional classifications to be evaluated.
System Scalability
The system is limited now to a single machine, but to grow and handle more classifications, it would need to be modified to run on multiple machines, have a web-based user interface developed and a capable database to handle the massive amounts of data that would be required to support a data set on the scale of the complete Cover Browser’s or similar sites’ databases (e.g., 450,000 comic book covers scaled linearly would require 546 GiB of storage.) Not to mention data center considerations for overall system availability and scalability.
References
Aly, Mohamed. Survey on Multiclass Classification Methods. [pdf] Rep. Oct. 2011. Caltech. 24 Aug. 2012.
Asmar, Nakhle H. Partial Differential Equations: With Fourier Series and Boundary Value Problems. 2nd ed. Upper Saddle River, NJ: Pearson Prentice Hall, 2004. Print.
Bousquet, Olivier, Stephane Boucheron, and Gabor Lugosi. “Introduction to Statistical Learning Theory.” [pdf] Advanced Lectures on Machine Learning 2003,Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003 (2004): 169-207. 7 July 2012.
Boyd, Stephen, and Lieven Vandenberghe. Convex Optimization [pdf]. N.p.: Cambridge UP, 2004. Web. 28 June 2012.
Burden, Richard L., and J. Douglas. Faires. Numerical Analysis. 8th ed. Belmont, CA: Thomson Brooks/Cole, 2005. Print.
Caruana, Rich, Nikos Karampatziakis, and Ainur Yessenalina. “An Empirical Evaluation of Supervised Learning in High Dimensions.” [pdf] ICML ’08 Proceedings of the 25th international conference on Machine learning (2008): 96-103. 2 May 2008. 6 June 2012.
Fukunaga, Keinosuke, and Patrenahalli M. Narendra. “A Branch and Bound Algorithm for Computing k-Nearest Neighbors.” [pdf] IEEE Transactions on Computers (1975): 750-53. 9 Jan. 2004. 27 Aug. 2012.
Gerlach, U. H. Linear Mathematics in Infinite Dimensions: Signals, Boundary Value Problems and Special Functions. Beta ed. 09 Dec. 2010. Web. 29 June 2012.
Glynn, Earl F. “Fourier Analysis and Image Processing.” [pdf] Lecture. Bioinformatics Weekly Seminar. 14 Feb. 2007. Web. 29 May 2012.
Gunn, Steve R. “Support Vector Machines for Classification and Regression” [pdf]. Working paper. 10 May 1998. University of Southampton. 6 June 2012.
Hlavac, Vaclav. “Fourier Transform, in 1D and in 2D.” [pdf] Lecture. Czech Technical University in Prague, 6 Mar. 2012. Web. 30 May 2012.
Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin. A Practical Guide to Support Vector Classification. [pdf] Tech. 18 May 2010. National Taiwan University. 6 June 2012.
Kibriya, Ashraf M. and Eibe Frank. “An empirical comparison of exact nearest neighbour algorithms.” [pdf] Proc 11th European Conference on Principles and Practice of Knowledge Discovery in Databases. (2007): 140-51. 27 Aug. 2012.
Marshall, A. D. “Vision Systems.” Vision Systems. Web. 29 May 2012.
Panigraphy, Rina. Nearest Neighbor Search using Kd-trees. [pdf] Tech. 4 Dec. 2006. Stanford University. 27 Aug. 2012.
Pantic, Maja. “Lecture 11-12: Evaluating Hypotheses.” [pdf] Imperial College London. 27 Aug. 2012.
Platt, John C. “Fast Training of Support Vector Machines Using Sequential Minimal Optimization.” [pdf] Advances in Kernel Methods – Support Vector Learning (1999): 185-208. Microsoft Research. Web. 29 June 2012.
Sonka, Milan, Vaclav Hlavac, and Roger Boyle. Image Processing, Analysis, and Machine Vision. 2nd ed. CL-Engineering, 1998. 21 Aug. 2000. Web. 29 May 2012.
Szeliski, Richard. Computer vision: Algorithms and applications. London: Springer, 2011. Print.
Tam, Pang-Ning, Michael Steinbach, and Vipin Kumar. “Classification: Basic Concepts, Decision Trees, and Model Evaluation.” [pdf] Introduction to Data Mining. Addison-Wesley, 2005. 145-205. 24 Aug. 2012.
Vajda, Steven. Mathematical programming. Mineola, NY: Dover Publications, 2009. Print.
Welling, Max. “Support Vector Machines“. [pdf] 27 Jan. 2005. University of Toronto. 28 June 2012
Weston, Jason. “Support Vector Machine (and Statistical Learning Theory) Tutorial.” [pdf] Columbia University, New York City. 7 Nov. 2007. 28 June 2012.
Zhang, Hui, Jason E. Fritts, and Sally A. Goldman. “Image Segmentation Evaluation: A Survey of Unsupervised Methods.” [pdf] Computer Vision and Image Understanding 110 (2008): 260-80. 24 Aug. 2012.
Copyright
Images in this post are used under §107(2) Limitations on exclusive rights: Fair use of Chapter 1: Subject Matter and Scope of Copyright of the of the Copyright Act of 1976 of Title 17 of the United States Code.
Tropical Representation of the All-Pairs Shortest Path Problem
Motivation
While I was doing my Abstract Algebra research the other month, I came across an interesting way of simplifying the representation of the all-pairs shortest path problem using Tropical Geometry. I thought it was pretty clever, so I thought I’d do a quick write-up.
Problem Statement
The all-pairs shortest path problem is to identify the minimum path cost, , out of the possible paths
between vertices
and
.
Proposition
Consider a weighted directed graph (digraph), , consisting of vertices,
, and directed edges (arcs),
, and a function,
, yielding the weight of an edge. Only those weights from the positive affinely extended real numbers,
, are allowed per the problem statement. The adjacency matrix representation,
, of
is given by the following matrix:
Now, consider a semi-ring over whose additive operator,
, is given by the minimum function,
, and whose multiplicative operator,
, is given by addition,
. The additive unit is given by infinity,
, and the multiplicative unit by zero,
. This semi-ring is the Tropical Semi-ring
. (The namesake of tropical is in honor of Brazilian Imre Simon who developed this branch of mathematics.)
Linear algebra constructs can be tropicalized to yield the familiar definitions for matrix addition and multiplication for matricies and
.
Given the two prior statements, the elegant solution to the all-pairs shortest path problem is given by taking powers of the adjacency matrix: .
Proof
To see how this works out, start with . The matrix represents the minimum cost between any two adjacent vertices. In other words, the minimum cost for all paths containing a single edge. The next inductive step is to consider paths containing at most two adjacent edges. Squaring the adjacency matrix yields all such paths. When the matrix is squared, each edge is concatenated to all other adjacent edges and the minimum weight of the paths is selected. This thought process can iterated as follows:
The result is a typical Bellman equation. A graph can have at most edges between any two vertices, thus, the solution to the all-pairs shortest path problem is given by
.
Example
As a worked example, consider the following graph whose set of vertices is given by the set , set of arcs by
and weight function,
, as labeled on the graph.
The all-pairs shortest paths are given by the following calculations where the row and column coordinates correspond to the vertices of . Values in bold denote a change in the shortest path between two vertices.
Computational Complexity
From asymptotic standpoint, tropical matrix multiplication is still on the order of traditional matrix multiplication of . Computing the all-pairs shortest path problem using this approach is on the order of
since we must perform the tropical matrix multiplication
times. Now, This can be improved slightly since tropical matrix multiplication is associative, so we can leverage the repeated squaring approach and reduce the time complexity down to
.
The time complexity can be further reduced to using the Floyd-Warshall Algorithm, which is another dynamic programming approach that is similar in form to the tropical representation of the problem. In essence, it follows the same base case, but it’s recurrence statement only considers a range of vertices with respect to the two vertices being considered. A more in depth review of the algorithm can be found in the references.
References
“Floyd-Warshall’s Algorithm.” Algorithmist. Web. 12 Apr. 2012.
Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. “25.2 The Floyd-Warshall Algorithm.” Introduction to Algorithms. 2nd ed. Cambridge, MA: MIT, 2001. 629-35. Print.
Diestel, Reinhard. Graph theory. Heidelberg New York: Springer, 2010.
Laface, Antonio. Introduction to Tropical Geometry [pdf]. 29 Nov. 2006. Web. 11 Apr. 2012.
Maclagan, Diane, and Bernd Sturmfels. Introduction to Tropical Geometry [pdf]. 4 Nov. 2009. Web. 9 Apr. 2012.
Mohri, Mehryar. “Semiring Frameworks and Algorithms for Shortest-Distance Problems” [pdf]. Journal of Automata, Languages and Combinatorics 7 (2002) 3: 321-50. 8 Aug. 2002. Web. 31 Mar. 2012.