# Antimatroid, The

thoughts on computer science, electronics, mathematics

## k-Means Clustering using CUDAfy.NET

### Introduction

I’ve been wanting to learn how to utilize general purpose graphics processing units (GPGPUs) to speed up computation intensive machine learning algorithms, so I took some time to test the waters by implementing a parallelized version of the unsupervised k-means clustering algorithm using CUDAfy.NET– a C# wrapper for doing parallel computation on CUDA-enabled GPGPUs. I’ve also implemented sequential and parallel versions of the algorithm in C++ (Windows API), C# (.NET, CUDAfy.NET), and Python (scikit-learn, numpy) to illustrate the relative merits of each technology and paradigm on three separate benchmarks: varying point quantity, point dimension, and cluster quantity. I’ll cover the results, and along the way talk about performance and development considerations of the three technologies before wrapping up with how I’d like to utilize the GPGPU on more involved machine learning algorithms in the future.

### Algorithms

#### Sequential

The traditional algorithm attributed to [Stu82] begins as follows:

1. Pick $K$ points at random as the starting centroid of each cluster.
2. do (until convergence)
1. For each point in data set:
1. labels[point] = Assign(point, centroids)
2. centroids = Aggregate(points, labels)
3. convergence = DetermineConvergence()
3. return centroids

Assign labels each point with the label of the nearest centroid, and Aggregate updates the positions of the centroids based on the new point assignments. In terms of complexity, let’s start with the Assign routine. For each of the $N$ points we’ll compute the distance to each of the $K$ centroids and pick the centroid with the shortest distance that we’ll assign to the point. This is an example of the Nearest Neighbor Search problem. Linear search gives $\mathcal{O}( K N )$ which is preferable to using something like k-d trees which requires repeated superlinear construction and querying. Assuming Euclidean distance and points from $\mathbb{R}^d$, this gives time complexity $\mathcal{O}( d K N )$. The Aggregate routine will take $\mathcal{O}(d K N)$. Assuming convergence is guaranteed in $I$ iterations then the resulting complexity is $\mathcal{O}(d K N I)$ which lends to an effectively linear algorithm.

#### Parallel

[LiFa89] was among the first to study several different shared memory parallel algorithms for k-means clustering, and here I will be going with the following one:

1. Pick $K$ points at random as the starting centroid of each cluster.
2. Partition $N$ points into $P$ equally sized sets
3. Run to completion threadId from 1 to $P$ as:
1. do (until convergence)
1. sum, count = zero($K * d$), zero($K$)
2. For each point in partition[threadId]:
1. label = Assign(point, centroids)
2. For each dim in point:
1. sum[$d$ * label + dim] += point[dim]
3. count[label] = count[label] + 1
3. if(barrier.Synchronize())
1. centroids = sum / count
2. convergence = DetermineConvergence()
4. return centroids

The parallel algorithm can be viewed as $P$ smaller instances of the sequential algorithm processing $N/P$ chunks of points in parallel. There are two main departures from the sequential approach 1) future centroid positions are accumulated and counted after each labeling and 2) each iteration of $P$ while loops are synchronized before continuing on to the next iteration using a barrier – a way of ensuring all threads wait for the last thread to arrive, then continue to wait as the last one enters the barrier, and exits allowing the other threads to exit.

In terms of time complexity, Assign remains unchanged at $\mathcal{O}(d K)$, and incrementing the sums and counts for the point’s label takes time $\mathcal{O}(d + 1)$. Thus for $N/P$ points, a single iteration of the loop gives $\mathcal{O}( N/P (d K + d + 1) )$ time. Given $P$ threads, the maximum time would be given by the thread that enters the barrier, and assuming at most $I$ iterations, then the overall complexity is $\mathcal{O}(d I ( N (K + 1) + K P + 1 ) / P)$. Which suggests we should see at most a $\mathcal{O}(K P / (K + 1))$ speedup over the sequential implementation for large values of $N$.

#### GPGPU

The earliest work I found on doing k-means clustering on NVIDIA hardware in the academic literature was [MaMi09]. The following is based on that work, and the work I did above on the parallel algorithm:

1. Pick $K$ points at random as the starting centroid of each cluster.
2. Partition $N$ into $B$ blocks such that each block contains no more than $T$ points
3. do (until convergence)
1. Initialize sums, counts to zero
2. Process blockId 1 to $B$, $SM$ at a time in parallel on the GPGPU:
1. Initialize blockSum, blockCounts to zero
3. label = Assign(points[blockId * $T$ + threadId], centroids)
4. For each dim in points[blockId * $T$ + threadId]:
1. atomic blockSum[label * pointDim + dim] += points[blockId * $T$ + threadId]
5. atomic blockCount[label] += 1
1. atomic sums += blockSum
2. atomic counts += blockCounts
3. centroids = sums / counts
4. convergence = DetermineConvergence()

The initialization phase is similar to the parallel algorithm, although now we need to take into account the way that the GPGPU will process data. There are a handful of Streaming Multiprocessors on the GPGPU that process a single “block” at a time. Here we assign no more than $T$ points to a block such that each point runs as a single thread to be executed on each of the CUDA cores of the Streaming Multiprocessor.

When a single block is executing we’ll initialize the running sum and count as we did in the parallel case, then request that the threads running synchronize, then proceed to calculate the label of the point assigned to the thread atomically update the running sum and count. The threads must then synchronize again, and this time only the very first thread atomically copy those block level sum and counts over to the global sum and counts shared by all of the blocks.

Let’s figure out the time complexity. A single thread in a block being executed by a Streaming Multiprocessor takes time $\mathcal{O}( 2K + (3K + 1)d + 1 )$ assuming that all $T$ threads of the block execute in parallel, that there are $B$ blocks, and $S$ Streaming Multiprocessors, then the complexity becomes: $\mathcal{O}(B / S (2K + (3K + 1)d + 1) )$. Since $B = N / T$, and at most $I$ iterations can go by in parallel, we are left with $\mathcal{O}( I N (2K + (3K + 1)d + 1) / T S )$. So the expected speedup over the sequential algorithm should be $\mathcal{O}( d K T S / (2K + (3K + 1)d + 1) )$.

#### Expected performance

For large values of $N$, if we allow $K$ to be significantly larger than $d$, we should expect the parallel version to 8x faster than the sequential version and the GPGPU version to be 255x faster than the sequential version given that $P = 8, S = 2, T = 512$ for the given set of hardware that will be used to conduct tests. For $d$ to be significantly larger than $K$, then parallel is the same, and GPGPU version should be 340x faster than the sequential version. Now, it’s very important to point out that these are upper bounds. It is most likely that observed speedups will be significantly less due to technical issues like memory allocation, synchronization, and caching issues that are not incorporated (and difficult to incorporate) into the calculations.

### Implementations

I’m going to skip the sequential implementation since it’s not interesting. Instead, I’m going to cover the C++ parallel and C# GPGPU implementations in detail, then briefly mention how scikit-learn was configured for testing.

#### C++

The parallel Windows API implementation is straightforward. The following will begin with the basic building blocks, then get into the high level orchestration code. Let’s begin with the barrier implementation. Since I’m running on Windows 7, I’m unable to use the convenient InitializeSynchronizationBarrier, EnterSynchronizationBarrier, and DeleteSynchronizationBarrier API calls beginning with Windows 8. Instead I opted to implement a barrier using a condition variable and critical section as follows:

// ----------------------------------------------------------------------------
// Synchronization utility functions
// ----------------------------------------------------------------------------

struct Barrier {
CONDITION_VARIABLE conditionVariable;
CRITICAL_SECTION criticalSection;
int atBarrier;
int expectedAtBarrier;
};

void deleteBarrier(Barrier* barrier) {
DeleteCriticalSection(&(barrier->criticalSection));
// No API for delete condition variable
}

void initializeBarrier(Barrier* barrier, int numThreads) {
barrier->atBarrier = 0;

InitializeConditionVariable(&(barrier->conditionVariable));
InitializeCriticalSection(&(barrier->criticalSection));
}

bool synchronizeBarrier(Barrier* barrier, void (*func)(void*), void* data) {
bool lastToEnter = false;

EnterCriticalSection(&(barrier->criticalSection));

++(barrier->atBarrier);

if (barrier->atBarrier == barrier->expectedAtBarrier) {
barrier->atBarrier = 0;
lastToEnter = true;

func(data);

WakeAllConditionVariable(&(barrier->conditionVariable));
}
else {
SleepConditionVariableCS(&(barrier->conditionVariable), &(barrier->criticalSection), INFINITE);
}

LeaveCriticalSection(&(barrier->criticalSection));

return lastToEnter;
}


A Barrier struct contains the necessary details of how many threads have arrived at the barrier, how many are expected, and structs for the condition variable and critical section.

When a thread arrives at the barrier (synchronizeBarrier) it requests the critical section before attempting to increment the atBarrier variable. It checks to see if it is the last to arrive, and if so, resets the number of threads at the barrier to zero and invokes the callback to perform post barrier actions exclusively before notifying the other threads through the condition variable that they can resume. If the thread is not the last to arrive, then it goes to sleep until the condition variable is invoked. The reason why LeaveCriticalSection is included outside the the if statement is because SleepConditionVariableCS will release the critical section before putting the thread to sleep, then reacquire the critical section when it awakes. I don’t like that behavior since its an unnecessary acquisition of the critical section and slows down the implementation.

There is a single allocation routine which performs a couple different rounds of error checking when calling calloc; first to check if the routine returned null, and second to see if it set a Windows error code that I could inspect from GetLastError. If either event is true, the application will terminate.

// ----------------------------------------------------------------------------
// Allocation utility functions
// ----------------------------------------------------------------------------

void* checkedCalloc(size_t count, size_t size) {
SetLastError(NO_ERROR);

void* result = calloc(count, size);
DWORD lastError = GetLastError();

if (result == NULL) {
fprintf(stdout, "Failed to allocate %d bytes. GetLastError() = %d.", size, lastError);
ExitProcess(EXIT_FAILURE);
}

if (result != NULL && lastError != NO_ERROR) {
fprintf(stdout, "Allocated %d bytes. GetLastError() = %d.", size, lastError);
ExitProcess(EXIT_FAILURE);
}

return result;
}


Now on to the core of the implementation. A series of structs are specified for those data that are shared (e.g., points, centroids, etc) among the threads, and those that are local to each thread (e.g., point boundaries, partial results).

// ----------------------------------------------------------------------------
// Parallel Implementation
// ----------------------------------------------------------------------------

struct LocalAssignData;

struct SharedAssignData {
Barrier barrier;
bool continueLoop;

int numPoints;
int pointDim;
int K;

double* points;
double* centroids;
int* labels;

int maxIter;
double change;
double pChange;

DWORD numProcessors;

LocalAssignData* local;
};

struct LocalAssignData {
SharedAssignData* shared;
int begin;
int end;

int* labelCount;
double* partialCentroids;
};


The assign method does exactly what was specified in the parallel algorithm section. It will iterate over the portion of points it is responsible for, compute their labels and its partial centroids (sum of points with label $k$, division done at aggregate step.).

void assign(int* label, int begin, int end, int* labelCount, int K, double* points, int pointDim, double* centroids, double* partialCentroids) {
int* local = (int*)checkedCalloc(end - begin, sizeof(int));

int* localCount = (int*)checkedCalloc(K, sizeof(int));
double* localPartial = (double*)checkedCalloc(pointDim * K, sizeof(double));

// Process a chunk of the array.
for (int point = begin; point < end; ++point) {
double optDist = INFINITY;
int optCentroid = -1;

for (int centroid = 0; centroid < K; ++centroid) {
double dist = 0.0;
for (int dim = 0; dim < pointDim; ++dim) {
double d = points[point * pointDim + dim] - centroids[centroid * pointDim + dim];
dist += d * d;
}

if (dist < optDist) {
optDist = dist;
optCentroid = centroid;
}
}

local[point - begin] = optCentroid;
++localCount[optCentroid];

for (int dim = 0; dim < pointDim; ++dim)
localPartial[optCentroid * pointDim + dim] += points[point * pointDim + dim];
}

memcpy(&label[begin], local, sizeof(int) * (end - begin));
free(local);

memcpy(labelCount, localCount, sizeof(int) * K);
free(localCount);

memcpy(partialCentroids, localPartial, sizeof(double) * pointDim * K);
free(localPartial);
}


One thing that I experimented with that gave me better performance was allocating and using memory within the function instead of allocating the memory outside and using within the assign routine. This in particular was motivated after I read about false sharing where two separate threads writing to the same cache line cause coherence updates to cascade in the CPU causing overall performance to degrade. For labelCount and partialCentroids they’re reallocated since I was concerned about data locality and wanted the three arrays to be relatively in the same neighborhood of memory. Speaking of which, memory coalescing is used for the points array so that point dimensions are adjacent in memory to take advantage of caching. Overall, a series of cache friendly optimizations.

The aggregate routine follows similar set of enhancements. The core of the method is to compute the new centroid locations based on the partial sums and centroid assignment counts given by args->shared->local[t].partialCentroids and args->shared->local[t].labelCount[t]. Using these partial results all the routine to complete in $\mathcal{O}(P K d)$ time which assuming all of these parameters are significantly less than $N$, gives a constant time routine. Once the centroids have been updated, the change in their location is computed and used to determine convergence along with how many iterations have gone by. Here if more than 1,000 iterations have occurred or the relative change in position is less than some tolerance (0.1%) then the threads will terminate.

void aggregate(void * data) {
LocalAssignData* args = (LocalAssignData*)data;

int* assignmentCounts = (int*)checkedCalloc(args->shared->K, sizeof(int));
double* newCentroids = (double*)checkedCalloc(args->shared->K * args->shared->pointDim, sizeof(double));

// Compute the assignment counts from the work the threads did.
for (int t = 0; t < args->shared->numThreads; ++t)
for (int k = 0; k < args->shared->K; ++k)
assignmentCounts[k] += args->shared->local[t].labelCount[k];

// Compute the location of the new centroids based on the work that the
for (int t = 0; t < args->shared->numThreads; ++t)
for (int k = 0; k < args->shared->K; ++k)
for (int dim = 0; dim < args->shared->pointDim; ++dim)
newCentroids[k * args->shared->pointDim + dim] += args->shared->local[t].partialCentroids[k * args->shared->pointDim + dim];

for (int k = 0; k < args->shared->K; ++k)
for (int dim = 0; dim < args->shared->pointDim; ++dim)
newCentroids[k * args->shared->pointDim + dim] /= assignmentCounts[k];

// See by how much did the position of the centroids changed.
args->shared->change = 0.0;
for (int k = 0; k < args->shared->K; ++k)
for (int dim = 0; dim < args->shared->pointDim; ++dim) {
double d = args->shared->centroids[k * args->shared->pointDim + dim] - newCentroids[k * args->shared->pointDim + dim];
args->shared->change += d * d;
}

// Store the new centroid locations into the centroid output.
memcpy(args->shared->centroids, newCentroids, sizeof(double) * args->shared->pointDim * args->shared->K);

// Decide if the loop should continue or terminate. (max iterations
// exceeded, or relative change not exceeded.)
args->shared->continueLoop = args->shared->change > 0.001 * args->shared->pChange && --(args->shared->maxIter) > 0;

args->shared->pChange = args->shared->change;

free(assignmentCounts);
free(newCentroids);
}


Each individual thread follows the same specification as given in the parallel algorithm section, and follows the calling convention required by the Windows API.

DWORD WINAPI assignThread(LPVOID data) {
LocalAssignData* args = (LocalAssignData*)data;

while (args->shared->continueLoop) {
memset(args->labelCount, 0, sizeof(int) * args->shared->K);

// Assign points cluster labels
assign(args->shared->labels, args->begin, args->end, args->labelCount, args->shared->K, args->shared->points, args->shared->pointDim, args->shared->centroids, args->partialCentroids);

// Tell the last thread to enter here to aggreagate the data within a
// critical section
synchronizeBarrier(&(args->shared->barrier), aggregate, args);
};

return 0;
}


The parallel algorithm controller itself is fairly simple and is responsible for basic preparation, bookkeeping, and cleanup. The number of processors is used to determine the number of threads to launch. The calling thread will run one instance will the remaining $P - 1$ instances will run on separate threads. The data is partitioned, then the threads are spawned using the CreateThread routine. I wish there was a Windows API that would allow me to simultaneously create $P$ threads with a specified array of arguments because CreateThread will automatically start the thread as soon as it’s created. If lots of threads are being created, then the first will wait a long time before the last one gets around to reaching the barrier. Subsequent iterations of the synchronized loops will have better performance, but it would be nice to avoid that initial delay. After kicking off the threads, the main thread will run its own block of data, and once all threads terminate, the routine will close open handles and free allocated memory.

void kMeansFitParallel(double* points, int numPoints, int pointDim, int K, double* centroids) {
// Lookup and calculate all the threading related values.
SYSTEM_INFO systemInfo;
GetSystemInfo(&systemInfo);

DWORD numProcessors = systemInfo.dwNumberOfProcessors;
DWORD numThreads = numProcessors - 1;
DWORD pointsPerProcessor = numPoints / numProcessors;

// Prepare the shared arguments that will get passed to each thread.
SharedAssignData shared;
shared.numPoints = numPoints;
shared.pointDim = pointDim;
shared.K = K;
shared.points = points;

shared.continueLoop = true;
shared.maxIter = 1000;
shared.pChange = 0.0;
shared.change = 0.0;
shared.numProcessors = numProcessors;

initializeBarrier(&(shared.barrier), numProcessors);

shared.centroids = centroids;
for (int i = 0; i < K; ++i) {
int point = rand() % numPoints;
for (int dim = 0; dim < pointDim; ++dim)
shared.centroids[i * pointDim + dim] = points[point * pointDim + dim];
}

shared.labels = (int*)checkedCalloc(numPoints, sizeof(int));

LocalAssignData* local = (LocalAssignData*)checkedCalloc(numProcessors, sizeof(LocalAssignData));
for (int i = 0; i < numProcessors; ++i) {
local[i].shared = &shared;
local[i].begin = i * pointsPerProcessor;
local[i].end = min((i + 1) * pointsPerProcessor, numPoints);
local[i].labelCount = (int*)checkedCalloc(K, sizeof(int));
local[i].partialCentroids = (double*)checkedCalloc(K * pointDim, sizeof(double));
}

shared.local = local;

for (int i = 0; i < numThreads; ++i)

// Do work on this thread so that it's just not sitting here idle while the
// other threads are doing work.

// Clean up
for (int i = 0; i < numThreads; ++i)

for (int i = 0; i < numProcessors; ++i) {
free(local[i].labelCount);
free(local[i].partialCentroids);
}

free(local);

free(shared.labels);

deleteBarrier(&(shared.barrier));
}


#### C#

The CUDAfy.NET GPGPU C# implementation required a lot of experimentation to find an efficient solution.

In the GPGPU paradigm there is a host and a device in which sequential operations take place on the host (ie. managed C# code) and parallel operations on the device (ie. CUDA code). To delineate between the two, the [Cudafy] method attribute is used on the static public method assign. The set of host operations are all within the Fit routine.

Under the CUDA model, threads are bundled together into blocks, and blocks together into a grid. Here the data is partitioned so that each block consists of half the maximum number of threads possible per block and the total number of blocks is the number of points divided by that quantity. This was done through experimentation, and motivated by Thomas Bradley’s Advanced CUDA Optimization workshop notes [pdf] that suggest at that regime the memory lines become saturated and cannot yield better throughput. Each block runs on a Streaming Multiprocessor (a collection of CUDA cores) having shared memory that the threads within the block can use. These blocks are then executed in pipeline fashion on the available Streaming Multiprocessors to give the desired performance from the GPGPU.

What is nice about the shared memory is that it is much faster than the global memory of the GPGPU. (cf. Using Shared Memory in CUDA C/C++) To make use of this fact the threads will rely on two arrays in shared memory: sum of the points and the count of those belonging to each centroid. Once the arrays have been zeroed out by the threads, all of the threads will proceed to find the nearest centroid of the single point they are assigned to and then update those shared arrays using the appropriate atomic operations. Once all of the threads complete that assignment, the very first thread will then add the arrays in shared memory to those in the global memory using the appropriate atomic operations.

using Cudafy;
using Cudafy.Host;
using Cudafy.Translator;
using Cudafy.Atomics;
using System;

namespace CUDAfyTesting {
public class CUDAfyKMeans {
[Cudafy]
public static void assign(GThread thread, int[] constValues, double[] centroids, double[] points, float[] outputSums, int[] outputCounts) {
// Unpack the const value array
int pointDim = constValues[0];
int K = constValues[1];
int numPoints = constValues[2];

// Ensure that the point is within the boundaries of the points
// array.
if (point >= numPoints)
return;

// Use two shared arrays since they are much faster than global
// memory. The shared arrays will be scoped to the block that this

// Accumulate the each point's dimension assigned to the k'th
// centroid. When K = 128 => pointDim = 2; when pointDim = 128
// => K = 2; Thus max(len(sharedSums)) = 256.
if (tId < K * pointDim)
sharedSums[tId] = 0.0f;

// Keep track of how many times the k'th centroid has been assigned
// to a point. max(K) = 128
if (tId < K)
sharedCounts[tId] = 0;

// Make sure all threads share the same shared state before doing
// any calculations.

// Find the optCentroid for point.
double optDist = double.PositiveInfinity;
int optCentroid = -1;

for (int centroid = 0; centroid < K; ++centroid) {
double dist = 0.0;
for (int dim = 0; dim < pointDim; ++dim) {
double d = centroids[centroid * pointDim + dim] - points[point * pointDim + dim];
dist += d * d;
}

if (dist < optDist) {
optDist = dist;
optCentroid = centroid;
}
}

// Add the point to the optCentroid sum
for (int dim = 0; dim < pointDim; ++dim)
// CUDA doesn't support double precision atomicAdd so cast down
// to float...

// Increment the optCentroid count

// Wait for all of the threads to complete populating the shared
// memory before storing the results back to global memory where
// the host can access the results.

// Have to do a lock on both of these since some other Streaming
// Multiprocessor could be running and attempting to update the
// values at the same time.

// Copy the shared sums to the output sums
if (tId == 0)
for (int i = 0; i < K * pointDim; ++i)

// Copy the shared counts to the output counts
if (tId == 0)
for (int i = 0; i < K; i++)
}


Before going on to the Fit method, let’s look at what CUDAfy.NET is doing under the hood to convert the C# code to run on the CUDA-enabled GPGPU. Within the CUDAfy.Translator namespace there are a handful of classes for decompiling the application into an abstract syntax tree using ICharpCode.Decompiler and Mono.Cecil, then converting the AST over to CUDA C via visitor pattern, next compiling the resulting CUDA C using NVIDIA’s NVCC compiler, and finally the compilation result is relayed back to the caller if there’s a problem; otherwise, a CudafyModule instance is returned, and the compiled CUDA C code it represents loaded up on the GPGPU. (The classes and method calls of interest are: CudafyTranslator.DoCudafy, CudaLanguage.RunTransformsAndGenerateCode, CUDAAstBuilder.GenerateCode, CUDAOutputVisitor and CudafyModule.Compile.)

        private CudafyModule cudafyModule;
private GPGPU gpgpu;
private GPGPUProperties properties;

public int PointDim { get; private set; }
public double[] Centroids { get; private set; }

public CUDAfyKMeans() {
cudafyModule = CudafyTranslator.Cudafy();

gpgpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
properties = gpgpu.GetDeviceProperties(true);

}


The Fit method follows the same paradigm that I presented earlier with the C++ code. The main difference here is the copying of managed .NET resources (arrays) over to the device. I found these operations to be relatively time intensive and I did find some suggestions from the CUDAfy.NET website on how to use pinned memory- essentially copy the managed memory to unmanaged memory, then do an asynchronous transfer from the host to the device. I tried this with the points arrays since its the largest resource, but did not see noticeable gains so I left it as is.

At the beginning of each iteration of the main loop, the device counts and sums are cleared out through the Set method, then the CUDA code is invoked using the Launch routine with the specified block and grid dimensions and device pointers. One thing that the API does is return an array when you allocate or copy memory over to the device. Personally, an IntPtr seems more appropriate. Execution of the routine is very quick, where on some of my tests it took 1 to 4 ms to process 100,000 two dimensional points. Once the routine returns, memory from the device (sum and counts) is copied back over to the host which then does a quick operation to derive the new centroid locations and copy that memory over to the device for the next iteration.

        public void Fit(double[] points, int pointDim, int K) {
if (K <= 0)
throw new ArgumentOutOfRangeException("K", "Must be greater than zero.");

if (pointDim <= 0)
throw new ArgumentOutOfRangeException("pointDim", "Must be greater than zero.");

if (points.Length < pointDim)
throw new ArgumentOutOfRangeException("points", "Must have atleast pointDim entries.");

if (points.Length % pointDim != 0)
throw new ArgumentException("points.Length must be n * pointDim > 0.");

int numPoints = points.Length / pointDim;

// Figure out the partitioning of the data.
int numBlocks = (numPoints / threadsPerBlock) + (numPoints % threadsPerBlock > 0 ? 1 : 0);

dim3 blockSize = new dim3(threadsPerBlock, 1, 1);

dim3 gridSize = new dim3(
Math.Min(properties.MaxGridSize.x, numBlocks),
Math.Min(properties.MaxGridSize.y, (numBlocks / properties.MaxGridSize.x) + (numBlocks % properties.MaxGridSize.x > 0 ? 1 : 0)),
1
);

int[] constValues = new int[] { pointDim, K, numPoints };
float[] assignmentSums = new float[pointDim * K];
int[] assignmentCount = new int[K];

// Initial centroid locations picked at random
Random prng = new Random();
double[] centroids = new double[K * pointDim];
for (int centroid = 0; centroid < K; centroid++) {
int point = prng.Next(points.Length / pointDim);
for (int dim = 0; dim < pointDim; dim++)
centroids[centroid * pointDim + dim] = points[point * pointDim + dim];
}

// These arrays are only read from on the GPU- they are never written
// on the GPU.
int[] deviceConstValues = gpgpu.CopyToDevice<int>(constValues);
double[] deviceCentroids = gpgpu.CopyToDevice<double>(centroids);
double[] devicePoints = gpgpu.CopyToDevice<double>(points);

// These arrays are written written to on the GPU.
float[] deviceSums = gpgpu.CopyToDevice<float>(assignmentSums);
int[] deviceCount = gpgpu.CopyToDevice<int>(assignmentCount);

// Set up main loop so that no more than maxIter iterations take
// place, and that a realative change less than 1% in centroid
// positions will terminate the loop.
int maxIter = 1000;
double change = 0.0, pChange = 0.0;

do {
pChange = change;

// Clear out the assignments, and assignment counts on the GPU.
gpgpu.Set(deviceSums);
gpgpu.Set(deviceCount);

// Lauch the GPU portion
gpgpu.Launch(gridSize, blockSize, "assign", deviceConstValues, deviceCentroids, devicePoints, deviceSums, deviceCount);

// Copy the results memory from the GPU over to the CPU.
gpgpu.CopyFromDevice<float>(deviceSums, assignmentSums);
gpgpu.CopyFromDevice<int>(deviceCount, assignmentCount);

// Compute the new centroid locations.
double[] newCentroids = new double[centroids.Length];
for (int centroid = 0; centroid < K; ++centroid)
for (int dim = 0; dim < pointDim; ++dim)
newCentroids[centroid * pointDim + dim] = assignmentSums[centroid * pointDim + dim] / assignmentCount[centroid];

// Calculate how much the centroids have changed to decide
// whether or not to terminate the loop.
change = 0.0;
for (int centroid = 0; centroid < K; ++centroid)
for (int dim = 0; dim < pointDim; ++dim) {
double d = newCentroids[centroid * pointDim + dim] - centroids[centroid * pointDim + dim];
change += d * d;
}

// Update centroid locations on CPU & GPU
Array.Copy(newCentroids, centroids, newCentroids.Length);
deviceCentroids = gpgpu.CopyToDevice<double>(centroids);

} while (change > 0.01 * pChange && --maxIter > 0);

gpgpu.FreeAll();

this.Centroids = centroids;
this.PointDim = pointDim;
}
}
}


#### Python

I include the Python implementation for the sake of demonstrating how scikit-learn was invoked throughout the following experiments section.

model = KMeans(
n_clusters = numClusters,
init='random',
n_init = 1,
max_iter = 1000,
tol = 1e-3,
precompute_distances = False,
verbose = 0,
copy_x = False,
);

model.fit(X);    // X = (numPoints, pointDim) numpy array.


### Experimental Setup

All experiments where conducted on a laptop with an Intel Core i7-2630QM Processor and NVIDIA GeForce GT 525M GPGPU running Windows 7 Home Premium. C++ and C# implementations were developed and compiled by Microsoft Visual Studio Express 2013 for Desktop targeting C# .NET Framework 4.5 (Release, Mixed Platforms) and C++ (Release, Win32). Python implementation was developed and compiled using Eclipse Luna 4.4.1 targeting Python 2.7, scikit-learn 0.16.0, and numpy 1.9.1. All compilers use default arguments and no extra optimization flags.

For each test, each reported test point is the median of thirty sample run times of a given algorithm and set of arguments. Run time is computed as the (wall) time taken to execute model.fit(points, pointDim, numClusters) where time is measured by: QueryPerformanceCounter in C++, System.Diagnostics.Stopwatch in C#, and time.clock in Python. Every test is based on a dataset having two natural clusters at .25 or -.25 in each dimension.

### Results

#### Varying point quantity

Both the C++ and C# sequential and parallel implementations outperform the Python scikit-learn implementations. However, the C++ sequential and parallel implementations outperforms their C# counterparts. Though the C++ sequential and parallel implementations are tied, as it seems the overhead associated with multithreading overrides any multithreaded performance gains one would expect. The C# CUDAfy.NET implementation surprisingly does not outperform the C# parallel implementation, but does outperform the C# sequential one as the number of points to cluster increases.

So what’s the deal with Python scikit-learn? Why is the parallel version so slow? Well, it turns out I misunderstood the nJobs parameter. I interpreted this to mean that process of clustering a single set of points would be done in parallel; however, it actually means that the number of simultaneous runs of the whole process will occur in parallel. I was tipped off to this when I noticed multiple python.exe fork processes being spun off which surprised me that someone would implement a parallel routine that way leading to a more thorough reading the scikit-learn documentation. There is parallelism going on with scikit-learn, just not the desired type. Taking that into account the linear one performs reasonably well for being a dynamically typed interpreted language.

#### Varying point dimension

The C++ and C# parallel implementations exhibit consistent improved run time over their sequential counterparts. In all cases the performance is better than scikit-learn’s. Surprisingly, the C# CUDAfy.NET implementation does worse than both the C# sequential and parallel implementations. Why do we not better CUDAfy.NET performance? The performance we see is identical to the vary point quantity test. So on one hand it’s nice that increasing the point dimensions did not dramatically increase the run time, but ideally, the CUDAfy.NET performance should be better than the sequential and parallel C# variants for this test. My leading theory is that higher point dimensions result in more data that must be transferred between host and device which is a relatively slow process. Since I’m short on time, this will have to be something I investigate in more detail in the future.

#### Varying cluster quantity

As in the point dimension test, the C++ and C# parallel implementations outperform their sequential counterparts, while the scikit-learn implementation starts to show some competitive performance. The exciting news of course is that varying the cluster size finally reveals improved C# CUDAfy.NET run time. Now there is some curious behavior at the beginning of each plot. We get $\le 10 \text{ ms}$ performance for two clusters, then jump up into about $\le 100 \text{ ms}$ for four to eight clusters. Number of points and their dimension are held constant, but we allocate a few extra double’s for the cluster centroids. I believe this has to do with cache behavior. I’m assuming for fewer than four clusters everything that’s needed sits nicely in the fast L1 cache, and moving up to four and more clusters requires more exchanging of data between L1, L2, L3, and (slower) memory memory to the different cores of the Intel Core i7-2630QM processor I’m using. As before, I’ll need to do some more tests to verify that this is what is truly happening.

#### Language comparison

For the three tests considered, the C++ implementations gave the best run time performance on point quantity and point dimension tests while the C# CUDAfy.NET implementation gave the best performance on the cluster quantity test.

The C++ implementation could be made to run faster be preallocating memory in the same fashion that C# does. In C# when an application is first created a block of memory is allocated for the managed heap. As a result, allocation of reference types in C# is done by incrementing a pointer instead of doing an unmanaged allocation (malloc, etc.). (cf. Automatic Memory Management) This allocation takes place before executing the C# routines, while the same allocation takes place during the C++ routines. Hence, the C++ run times will have an overhead not present in the C# run times. Had I implemented memory allocation in C++ the same as it’s done in C#, then the C++ implementation would be undoubtedly even faster than the C# ones.

While using scikit-learn in Python is convenient for exploratory data analysis and prototyping machine learning algorithms, it leaves much to be desired in performance; frequently coming ten times slower than the other two implementations on the varying point quantity and dimension tests, but within tolerance on the vary cluster quantity tests.

### Future Work

The algorithmic approach here was to parallelize work on data points, but as the dimension of each point increases, it may make sense to explore algorithms that parallelize work across dimensions instead of points.

I’d like to spend more time figuring out some of the high-performance nuances of programming the GPGPU (as well as traditional C++), which take more time and patience than a week or two I spent on this. In addition, I’d like to dig a little deeper into doing CUDA C directly rather than through the convenient CUDAfy.NET wrapper; as well as explore OpenMP and OpenCL to see how they compare from a development and performance-oriented view to CUDA.

Python and scikit-learn were used a baseline here, but it would be worth spending extra time to see how R and Julia compare, especially the latter since Julia pitches itself as a high-performance solution, and is used for exploratory data analysis and prototyping machine learning systems.

While the emphasis here was on trying out CUDAfy.NET and getting some exposure to GPGPU programming, I’d like to apply CUDAfy.NET to the expectation maximization algorithm for fitting multivariate Gaussian mixture models to a dataset. GMMs are a natural extension of k-means clustering, and it will be good to implement the more involved EM algorithm.

### Conclusions

Through this exercise, we can expect to see modest speedups over sequential implementations of about 2.62x and 11.69x in the C# parallel and GPGPU implementations respectively when attempting to find large numbers of clusters on low dimensional data. Fortunately the way you use k-means clustering is to find the cluster quantity that maximizes the Bayesian information criterion or Akaike information criterion which means running the vary centroid quantity test on real data. On the other hand, most machine learning data is of a high dimension so further testing (on a real data set) would be needed to verify it’s effectiveness in a production environment. Nonetheless, we’ve seen how parallel and GPGPU based approaches can reduce the time it takes to complete the clustering task, and learned some things along the way that can be applied to future work.

### Bibliography

[LiFa89] Li Xiaobo and Fang Zhixi, “Parallel clustering algorithms”, Parallel Computing, 1989, 11(3): pp.275-290.

[MaMi09] Mario Zechner, Michael Granitzer. “Accelerating K-Means on the Graphics Processor via CUDA.” First International Conference on Intensive Applications and Services, INTENSIVE’09. pp. 7-15, 2009.

[Stu82] Stuart P. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28:129-137, 1982.