Antimatroid, The

thoughts on computer science, electronics, mathematics

k-Means Clustering using CUDAfy.NET

leave a comment »

Introduction

I’ve been wanting to learn how to utilize general purpose graphics processing units (GPGPUs) to speed up computation intensive machine learning algorithms, so I took some time to test the waters by implementing a parallelized version of the unsupervised k-means clustering algorithm using CUDAfy.NET– a C# wrapper for doing parallel computation on CUDA-enabled GPGPUs. I’ve also implemented sequential and parallel versions of the algorithm in C++ (Windows API), C# (.NET, CUDAfy.NET), and Python (scikit-learn, numpy) to illustrate the relative merits of each technology and paradigm on three separate benchmarks: varying point quantity, point dimension, and cluster quantity. I’ll cover the results, and along the way talk about performance and development considerations of the three technologies before wrapping up with how I’d like to utilize the GPGPU on more involved machine learning algorithms in the future.

Algorithms

Sequential

The traditional algorithm attributed to [Stu82] begins as follows:

  1. Pick K points at random as the starting centroid of each cluster.
  2. do (until convergence)
    1. For each point in data set:
      1. labels[point] = Assign(point, centroids)
    2. centroids = Aggregate(points, labels)
    3. convergence = DetermineConvergence()
  3. return centroids

Assign labels each point with the label of the nearest centroid, and Aggregate updates the positions of the centroids based on the new point assignments. In terms of complexity, let’s start with the Assign routine. For each of the N points we’ll compute the distance to each of the K centroids and pick the centroid with the shortest distance that we’ll assign to the point. This is an example of the Nearest Neighbor Search problem. Linear search gives \mathcal{O}( K N ) which is preferable to using something like k-d trees which requires repeated superlinear construction and querying. Assuming Euclidean distance and points from \mathbb{R}^d, this gives time complexity \mathcal{O}( d K N ). The Aggregate routine will take \mathcal{O}(d K N). Assuming convergence is guaranteed in I iterations then the resulting complexity is \mathcal{O}(d K N I) which lends to an effectively linear algorithm.

Parallel

[LiFa89] was among the first to study several different shared memory parallel algorithms for k-means clustering, and here I will be going with the following one:

  1. Pick K points at random as the starting centroid of each cluster.
  2. Partition N points into P equally sized sets
  3. Run to completion threadId from 1 to P as:
    1. do (until convergence)
      1. sum, count = zero(K * d), zero(K)
      2. For each point in partition[threadId]:
        1. label = Assign(point, centroids)
        2. For each dim in point:
          1. sum[d * label + dim] += point[dim]
        3. count[label] = count[label] + 1
      3. if(barrier.Synchronize())
        1. centroids = sum / count
        2. convergence = DetermineConvergence()
  4. return centroids

The parallel algorithm can be viewed as P smaller instances of the sequential algorithm processing N/P chunks of points in parallel. There are two main departures from the sequential approach 1) future centroid positions are accumulated and counted after each labeling and 2) each iteration of P while loops are synchronized before continuing on to the next iteration using a barrier – a way of ensuring all threads wait for the last thread to arrive, then continue to wait as the last one enters the barrier, and exits allowing the other threads to exit.

In terms of time complexity, Assign remains unchanged at \mathcal{O}(d K), and incrementing the sums and counts for the point’s label takes time \mathcal{O}(d + 1). Thus for N/P points, a single iteration of the loop gives \mathcal{O}( N/P (d K + d + 1) ) time. Given P threads, the maximum time would be given by the thread that enters the barrier, and assuming at most I iterations, then the overall complexity is \mathcal{O}(d I ( N (K + 1) + K P + 1 ) / P). Which suggests we should see at most a \mathcal{O}(K P / (K + 1)) speedup over the sequential implementation for large values of N.

GPGPU

The earliest work I found on doing k-means clustering on NVIDIA hardware in the academic literature was [MaMi09]. The following is based on that work, and the work I did above on the parallel algorithm:

  1. Pick K points at random as the starting centroid of each cluster.
  2. Partition N into B blocks such that each block contains no more than T points
  3. do (until convergence)
    1. Initialize sums, counts to zero
    2. Process blockId 1 to B, SM at a time in parallel on the GPGPU:
      1. If threadId == 0
        1. Initialize blockSum, blockCounts to zero
      2. Synchronize Threads
      3. label = Assign(points[blockId * T + threadId], centroids)
      4. For each dim in points[blockId * T + threadId]:
        1. atomic blockSum[label * pointDim + dim] += points[blockId * T + threadId]
      5. atomic blockCount[label] += 1
      6. Synchronize Threads
      7. If threadId == 0
        1. atomic sums += blockSum
        2. atomic counts += blockCounts
    3. centroids = sums / counts
    4. convergence = DetermineConvergence()

The initialization phase is similar to the parallel algorithm, although now we need to take into account the way that the GPGPU will process data. There are a handful of Streaming Multiprocessors on the GPGPU that process a single “block” at a time. Here we assign no more than T points to a block such that each point runs as a single thread to be executed on each of the CUDA cores of the Streaming Multiprocessor.

When a single block is executing we’ll initialize the running sum and count as we did in the parallel case, then request that the threads running synchronize, then proceed to calculate the label of the point assigned to the thread atomically update the running sum and count. The threads must then synchronize again, and this time only the very first thread atomically copy those block level sum and counts over to the global sum and counts shared by all of the blocks.

Let’s figure out the time complexity. A single thread in a block being executed by a Streaming Multiprocessor takes time \mathcal{O}( 2K + (3K + 1)d  + 1 ) assuming that all T threads of the block execute in parallel, that there are B blocks, and S Streaming Multiprocessors, then the complexity becomes: \mathcal{O}(B / S (2K + (3K + 1)d  + 1) ). Since B = N / T, and at most I iterations can go by in parallel, we are left with \mathcal{O}( I N (2K + (3K + 1)d  + 1) / T S ). So the expected speedup over the sequential algorithm should be \mathcal{O}( d K T S / (2K + (3K + 1)d  + 1) ).

Expected performance

For large values of N, if we allow K to be significantly larger than d, we should expect the parallel version to 8x faster than the sequential version and the GPGPU version to be 255x faster than the sequential version given that P = 8, S = 2, T = 512 for the given set of hardware that will be used to conduct tests. For d to be significantly larger than K, then parallel is the same, and GPGPU version should be 340x faster than the sequential version. Now, it’s very important to point out that these are upper bounds. It is most likely that observed speedups will be significantly less due to technical issues like memory allocation, synchronization, and caching issues that are not incorporated (and difficult to incorporate) into the calculations.

Implementations

I’m going to skip the sequential implementation since it’s not interesting. Instead, I’m going to cover the C++ parallel and C# GPGPU implementations in detail, then briefly mention how scikit-learn was configured for testing.

C++

The parallel Windows API implementation is straightforward. The following will begin with the basic building blocks, then get into the high level orchestration code. Let’s begin with the barrier implementation. Since I’m running on Windows 7, I’m unable to use the convenient InitializeSynchronizationBarrier, EnterSynchronizationBarrier, and DeleteSynchronizationBarrier API calls beginning with Windows 8. Instead I opted to implement a barrier using a condition variable and critical section as follows:

// ----------------------------------------------------------------------------
// Synchronization utility functions
// ----------------------------------------------------------------------------

struct Barrier {
	CONDITION_VARIABLE conditionVariable;
	CRITICAL_SECTION criticalSection;
	int atBarrier;
	int expectedAtBarrier;
};

void deleteBarrier(Barrier* barrier) {
	DeleteCriticalSection(&(barrier->criticalSection));
	// No API for delete condition variable
}

void initializeBarrier(Barrier* barrier, int numThreads) {
	barrier->atBarrier = 0;
	barrier->expectedAtBarrier = numThreads;

	InitializeConditionVariable(&(barrier->conditionVariable));
	InitializeCriticalSection(&(barrier->criticalSection));
}

bool synchronizeBarrier(Barrier* barrier, void (*func)(void*), void* data) {
	bool lastToEnter = false;

	EnterCriticalSection(&(barrier->criticalSection));

	++(barrier->atBarrier);

	if (barrier->atBarrier == barrier->expectedAtBarrier) {
		barrier->atBarrier = 0;
		lastToEnter = true;

		func(data);

		WakeAllConditionVariable(&(barrier->conditionVariable));
	}
	else {
		SleepConditionVariableCS(&(barrier->conditionVariable), &(barrier->criticalSection), INFINITE);
	}

	LeaveCriticalSection(&(barrier->criticalSection));

	return lastToEnter;
}

A Barrier struct contains the necessary details of how many threads have arrived at the barrier, how many are expected, and structs for the condition variable and critical section.

When a thread arrives at the barrier (synchronizeBarrier) it requests the critical section before attempting to increment the atBarrier variable. It checks to see if it is the last to arrive, and if so, resets the number of threads at the barrier to zero and invokes the callback to perform post barrier actions exclusively before notifying the other threads through the condition variable that they can resume. If the thread is not the last to arrive, then it goes to sleep until the condition variable is invoked. The reason why LeaveCriticalSection is included outside the the if statement is because SleepConditionVariableCS will release the critical section before putting the thread to sleep, then reacquire the critical section when it awakes. I don’t like that behavior since its an unnecessary acquisition of the critical section and slows down the implementation.

There is a single allocation routine which performs a couple different rounds of error checking when calling calloc; first to check if the routine returned null, and second to see if it set a Windows error code that I could inspect from GetLastError. If either event is true, the application will terminate.

// ----------------------------------------------------------------------------
// Allocation utility functions
// ----------------------------------------------------------------------------

void* checkedCalloc(size_t count, size_t size) {
	SetLastError(NO_ERROR);

	void* result = calloc(count, size);
	DWORD lastError = GetLastError();

	if (result == NULL) {
		fprintf(stdout, "Failed to allocate %d bytes. GetLastError() = %d.", size, lastError);
		ExitProcess(EXIT_FAILURE);
	}

	if (result != NULL && lastError != NO_ERROR) {
		fprintf(stdout, "Allocated %d bytes. GetLastError() = %d.", size, lastError);
		ExitProcess(EXIT_FAILURE);
	}

	return result;
}

Now on to the core of the implementation. A series of structs are specified for those data that are shared (e.g., points, centroids, etc) among the threads, and those that are local to each thread (e.g., point boundaries, partial results).

// ----------------------------------------------------------------------------
// Parallel Implementation
// ----------------------------------------------------------------------------

struct LocalAssignData;

struct SharedAssignData {
	Barrier barrier;
	bool continueLoop;

	int numPoints;
	int pointDim;
	int K;

	double* points;
	double* centroids;
	int* labels;

	int maxIter;
	double change;
	double pChange;

	DWORD numProcessors;
	DWORD numThreads;

	LocalAssignData* local;
};

struct LocalAssignData {
	SharedAssignData* shared;
	int begin;
	int end;

	int* labelCount;
	double* partialCentroids;
};

The assign method does exactly what was specified in the parallel algorithm section. It will iterate over the portion of points it is responsible for, compute their labels and its partial centroids (sum of points with label k, division done at aggregate step.).

void assign(int* label, int begin, int end, int* labelCount, int K, double* points, int pointDim, double* centroids, double* partialCentroids) {
	int* local = (int*)checkedCalloc(end - begin, sizeof(int));

	int* localCount = (int*)checkedCalloc(K, sizeof(int));
	double* localPartial = (double*)checkedCalloc(pointDim * K, sizeof(double));

	// Process a chunk of the array.
	for (int point = begin; point < end; ++point) {
		double optDist = INFINITY;
		int optCentroid = -1;

		for (int centroid = 0; centroid < K; ++centroid) {
			double dist = 0.0;
			for (int dim = 0; dim < pointDim; ++dim) {
				double d = points[point * pointDim + dim] - centroids[centroid * pointDim + dim];
				dist += d * d;
			}

			if (dist < optDist) {
				optDist = dist;
				optCentroid = centroid;
			}
		}

		local[point - begin] = optCentroid;
		++localCount[optCentroid];

		for (int dim = 0; dim < pointDim; ++dim)
			localPartial[optCentroid * pointDim + dim] += points[point * pointDim + dim];
	}

	memcpy(&label[begin], local, sizeof(int) * (end - begin));
	free(local);

	memcpy(labelCount, localCount, sizeof(int) * K);
	free(localCount);

	memcpy(partialCentroids, localPartial, sizeof(double) * pointDim * K);
	free(localPartial);
}

One thing that I experimented with that gave me better performance was allocating and using memory within the function instead of allocating the memory outside and using within the assign routine. This in particular was motivated after I read about false sharing where two separate threads writing to the same cache line cause coherence updates to cascade in the CPU causing overall performance to degrade. For labelCount and partialCentroids they’re reallocated since I was concerned about data locality and wanted the three arrays to be relatively in the same neighborhood of memory. Speaking of which, memory coalescing is used for the points array so that point dimensions are adjacent in memory to take advantage of caching. Overall, a series of cache friendly optimizations.

The aggregate routine follows similar set of enhancements. The core of the method is to compute the new centroid locations based on the partial sums and centroid assignment counts given by args->shared->local[t].partialCentroids and args->shared->local[t].labelCount[t]. Using these partial results all the routine to complete in \mathcal{O}(P K d) time which assuming all of these parameters are significantly less than N, gives a constant time routine. Once the centroids have been updated, the change in their location is computed and used to determine convergence along with how many iterations have gone by. Here if more than 1,000 iterations have occurred or the relative change in position is less than some tolerance (0.1%) then the threads will terminate.

void aggregate(void * data) {
	LocalAssignData* args = (LocalAssignData*)data;

	int* assignmentCounts = (int*)checkedCalloc(args->shared->K, sizeof(int));
	double* newCentroids = (double*)checkedCalloc(args->shared->K * args->shared->pointDim, sizeof(double));

	// Compute the assignment counts from the work the threads did.
	for (int t = 0; t < args->shared->numThreads; ++t)
		for (int k = 0; k < args->shared->K; ++k)
			assignmentCounts[k] += args->shared->local[t].labelCount[k];

	// Compute the location of the new centroids based on the work that the
	// threads did.
	for (int t = 0; t < args->shared->numThreads; ++t)
		for (int k = 0; k < args->shared->K; ++k)
			for (int dim = 0; dim < args->shared->pointDim; ++dim)
				newCentroids[k * args->shared->pointDim + dim] += args->shared->local[t].partialCentroids[k * args->shared->pointDim + dim];

	for (int k = 0; k < args->shared->K; ++k)
		for (int dim = 0; dim < args->shared->pointDim; ++dim)
			newCentroids[k * args->shared->pointDim + dim] /= assignmentCounts[k];

	// See by how much did the position of the centroids changed.
	args->shared->change = 0.0;
	for (int k = 0; k < args->shared->K; ++k)
		for (int dim = 0; dim < args->shared->pointDim; ++dim) {
			double d = args->shared->centroids[k * args->shared->pointDim + dim] - newCentroids[k * args->shared->pointDim + dim];
			args->shared->change += d * d;
		}

	// Store the new centroid locations into the centroid output.
	memcpy(args->shared->centroids, newCentroids, sizeof(double) * args->shared->pointDim * args->shared->K);

	// Decide if the loop should continue or terminate. (max iterations 
	// exceeded, or relative change not exceeded.)
	args->shared->continueLoop = args->shared->change > 0.001 * args->shared->pChange && --(args->shared->maxIter) > 0;

	args->shared->pChange = args->shared->change;

	free(assignmentCounts);
	free(newCentroids);
}

Each individual thread follows the same specification as given in the parallel algorithm section, and follows the calling convention required by the Windows API.

DWORD WINAPI assignThread(LPVOID data) {
	LocalAssignData* args = (LocalAssignData*)data;

	while (args->shared->continueLoop) {
		memset(args->labelCount, 0, sizeof(int) * args->shared->K);

		// Assign points cluster labels
		assign(args->shared->labels, args->begin, args->end, args->labelCount, args->shared->K, args->shared->points, args->shared->pointDim, args->shared->centroids, args->partialCentroids);

		// Tell the last thread to enter here to aggreagate the data within a 
		// critical section
		synchronizeBarrier(&(args->shared->barrier), aggregate, args);
	};

	return 0;
}

The parallel algorithm controller itself is fairly simple and is responsible for basic preparation, bookkeeping, and cleanup. The number of processors is used to determine the number of threads to launch. The calling thread will run one instance will the remaining P - 1 instances will run on separate threads. The data is partitioned, then the threads are spawned using the CreateThread routine. I wish there was a Windows API that would allow me to simultaneously create P threads with a specified array of arguments because CreateThread will automatically start the thread as soon as it’s created. If lots of threads are being created, then the first will wait a long time before the last one gets around to reaching the barrier. Subsequent iterations of the synchronized loops will have better performance, but it would be nice to avoid that initial delay. After kicking off the threads, the main thread will run its own block of data, and once all threads terminate, the routine will close open handles and free allocated memory.

void kMeansFitParallel(double* points, int numPoints, int pointDim, int K, double* centroids) {
	// Lookup and calculate all the threading related values.
	SYSTEM_INFO systemInfo;
	GetSystemInfo(&systemInfo);

	DWORD numProcessors = systemInfo.dwNumberOfProcessors;
	DWORD numThreads = numProcessors - 1;
	DWORD pointsPerProcessor = numPoints / numProcessors;

	// Prepare the shared arguments that will get passed to each thread.
	SharedAssignData shared;
	shared.numPoints = numPoints;
	shared.pointDim = pointDim;
	shared.K = K;
	shared.points = points;
	
	shared.continueLoop = true;
	shared.maxIter = 1000;
	shared.pChange = 0.0;
	shared.change = 0.0;
	shared.numThreads = numThreads;
	shared.numProcessors = numProcessors;

	initializeBarrier(&(shared.barrier), numProcessors);

	shared.centroids = centroids;
	for (int i = 0; i < K; ++i) {
		int point = rand() % numPoints;
		for (int dim = 0; dim < pointDim; ++dim)
			shared.centroids[i * pointDim + dim] = points[point * pointDim + dim];
	}

	shared.labels = (int*)checkedCalloc(numPoints, sizeof(int));

	// Create thread workload descriptors
	LocalAssignData* local = (LocalAssignData*)checkedCalloc(numProcessors, sizeof(LocalAssignData));
	for (int i = 0; i < numProcessors; ++i) {
		local[i].shared = &shared;
		local[i].begin = i * pointsPerProcessor;
		local[i].end = min((i + 1) * pointsPerProcessor, numPoints);
		local[i].labelCount = (int*)checkedCalloc(K, sizeof(int));
		local[i].partialCentroids = (double*)checkedCalloc(K * pointDim, sizeof(double));
	}

	shared.local = local;

	// Kick off the threads
	HANDLE* threads = (HANDLE*)checkedCalloc(numThreads, sizeof(HANDLE));
	for (int i = 0; i < numThreads; ++i)
		threads[i] = CreateThread(0, 0, assignThread, &local[i + 1], 0, NULL);

	// Do work on this thread so that it's just not sitting here idle while the 
	// other threads are doing work.
	assignThread(&local[0]);

	// Clean up
	WaitForMultipleObjects(numThreads, threads, true, INFINITE);
	for (int i = 0; i < numThreads; ++i)
		CloseHandle(threads[i]);

	free(threads);

	for (int i = 0; i < numProcessors; ++i) {
		free(local[i].labelCount);
		free(local[i].partialCentroids);
	}

	free(local);

	free(shared.labels);

	deleteBarrier(&(shared.barrier));
}

C#

The CUDAfy.NET GPGPU C# implementation required a lot of experimentation to find an efficient solution.

In the GPGPU paradigm there is a host and a device in which sequential operations take place on the host (ie. managed C# code) and parallel operations on the device (ie. CUDA code). To delineate between the two, the [Cudafy] method attribute is used on the static public method assign. The set of host operations are all within the Fit routine.

Under the CUDA model, threads are bundled together into blocks, and blocks together into a grid. Here the data is partitioned so that each block consists of half the maximum number of threads possible per block and the total number of blocks is the number of points divided by that quantity. This was done through experimentation, and motivated by Thomas Bradley’s Advanced CUDA Optimization workshop notes [pdf] that suggest at that regime the memory lines become saturated and cannot yield better throughput. Each block runs on a Streaming Multiprocessor (a collection of CUDA cores) having shared memory that the threads within the block can use. These blocks are then executed in pipeline fashion on the available Streaming Multiprocessors to give the desired performance from the GPGPU.

What is nice about the shared memory is that it is much faster than the global memory of the GPGPU. (cf. Using Shared Memory in CUDA C/C++) To make use of this fact the threads will rely on two arrays in shared memory: sum of the points and the count of those belonging to each centroid. Once the arrays have been zeroed out by the threads, all of the threads will proceed to find the nearest centroid of the single point they are assigned to and then update those shared arrays using the appropriate atomic operations. Once all of the threads complete that assignment, the very first thread will then add the arrays in shared memory to those in the global memory using the appropriate atomic operations.

using Cudafy;
using Cudafy.Host;
using Cudafy.Translator;
using Cudafy.Atomics;
using System;

namespace CUDAfyTesting {
    public class CUDAfyKMeans {
        [Cudafy]
        public static void assign(GThread thread, int[] constValues, double[] centroids, double[] points, float[] outputSums, int[] outputCounts) {
            // Unpack the const value array
            int pointDim = constValues[0];
            int K = constValues[1];
            int numPoints = constValues[2];

            // Ensure that the point is within the boundaries of the points 
            // array.
            int tId = thread.threadIdx.x;
            int point = thread.blockIdx.x * thread.blockDim.x + tId;
            if (point >= numPoints)
                return;

            // Use two shared arrays since they are much faster than global 
            // memory. The shared arrays will be scoped to the block that this 
            // thread belongs to.

            // Accumulate the each point's dimension assigned to the k'th 
            // centroid. When K = 128 => pointDim = 2; when pointDim = 128 
            // => K = 2; Thus max(len(sharedSums)) = 256.
            float[] sharedSums = thread.AllocateShared<float>("sums", 256);
            if (tId < K * pointDim)
                sharedSums[tId] = 0.0f;

            // Keep track of how many times the k'th centroid has been assigned 
            // to a point. max(K) = 128
            int[] sharedCounts = thread.AllocateShared<int>("counts", 128);
            if (tId < K)
                sharedCounts[tId] = 0;

            // Make sure all threads share the same shared state before doing 
            // any calculations.
            thread.SyncThreads();

            // Find the optCentroid for point.
            double optDist = double.PositiveInfinity;
            int optCentroid = -1;

            for (int centroid = 0; centroid < K; ++centroid) {
                double dist = 0.0;
                for (int dim = 0; dim < pointDim; ++dim) {
                    double d = centroids[centroid * pointDim + dim] - points[point * pointDim + dim];
                    dist += d * d;
                }

                if (dist < optDist) {
                    optDist = dist;
                    optCentroid = centroid;
                }
            }

            // Add the point to the optCentroid sum
            for (int dim = 0; dim < pointDim; ++dim)
                // CUDA doesn't support double precision atomicAdd so cast down 
                // to float...
                thread.atomicAdd(ref(sharedSums[optCentroid * pointDim + dim]), (float)points[point * pointDim + dim]);

            // Increment the optCentroid count
            thread.atomicAdd(ref(sharedCounts[optCentroid]), +1);


            // Wait for all of the threads to complete populating the shared 
            // memory before storing the results back to global memory where 
            // the host can access the results.
            thread.SyncThreads();

            // Have to do a lock on both of these since some other Streaming 
            // Multiprocessor could be running and attempting to update the 
            // values at the same time.

            // Copy the shared sums to the output sums
            if (tId == 0)
                for (int i = 0; i < K * pointDim; ++i)
                    thread.atomicAdd(ref(outputSums[i]), sharedSums[i]);

            // Copy the shared counts to the output counts
            if (tId == 0)
                for (int i = 0; i < K; i++)
                    thread.atomicAdd(ref(outputCounts[i]), sharedCounts[i]);
        }

Before going on to the Fit method, let’s look at what CUDAfy.NET is doing under the hood to convert the C# code to run on the CUDA-enabled GPGPU. Within the CUDAfy.Translator namespace there are a handful of classes for decompiling the application into an abstract syntax tree using ICharpCode.Decompiler and Mono.Cecil, then converting the AST over to CUDA C via visitor pattern, next compiling the resulting CUDA C using NVIDIA’s NVCC compiler, and finally the compilation result is relayed back to the caller if there’s a problem; otherwise, a CudafyModule instance is returned, and the compiled CUDA C code it represents loaded up on the GPGPU. (The classes and method calls of interest are: CudafyTranslator.DoCudafy, CudaLanguage.RunTransformsAndGenerateCode, CUDAAstBuilder.GenerateCode, CUDAOutputVisitor and CudafyModule.Compile.)

        private CudafyModule cudafyModule;
        private GPGPU gpgpu;
        private GPGPUProperties properties;

        public int PointDim { get; private set; }
        public double[] Centroids { get; private set; }

        public CUDAfyKMeans() {
            cudafyModule = CudafyTranslator.Cudafy();

            gpgpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
            properties = gpgpu.GetDeviceProperties(true);

            gpgpu.LoadModule(cudafyModule);
        }

The Fit method follows the same paradigm that I presented earlier with the C++ code. The main difference here is the copying of managed .NET resources (arrays) over to the device. I found these operations to be relatively time intensive and I did find some suggestions from the CUDAfy.NET website on how to use pinned memory- essentially copy the managed memory to unmanaged memory, then do an asynchronous transfer from the host to the device. I tried this with the points arrays since its the largest resource, but did not see noticeable gains so I left it as is.

At the beginning of each iteration of the main loop, the device counts and sums are cleared out through the Set method, then the CUDA code is invoked using the Launch routine with the specified block and grid dimensions and device pointers. One thing that the API does is return an array when you allocate or copy memory over to the device. Personally, an IntPtr seems more appropriate. Execution of the routine is very quick, where on some of my tests it took 1 to 4 ms to process 100,000 two dimensional points. Once the routine returns, memory from the device (sum and counts) is copied back over to the host which then does a quick operation to derive the new centroid locations and copy that memory over to the device for the next iteration.

        public void Fit(double[] points, int pointDim, int K) {
            if (K <= 0)
                throw new ArgumentOutOfRangeException("K", "Must be greater than zero.");

            if (pointDim <= 0)
                throw new ArgumentOutOfRangeException("pointDim", "Must be greater than zero.");

            if (points.Length < pointDim)
                throw new ArgumentOutOfRangeException("points", "Must have atleast pointDim entries.");

            if (points.Length % pointDim != 0)
                throw new ArgumentException("points.Length must be n * pointDim > 0.");

            int numPoints = points.Length / pointDim;

            // Figure out the partitioning of the data.
            int threadsPerBlock = properties.MaxThreadsPerBlock / 2;
            int numBlocks = (numPoints / threadsPerBlock) + (numPoints % threadsPerBlock > 0 ? 1 : 0);

            dim3 blockSize = new dim3(threadsPerBlock, 1, 1);

            dim3 gridSize = new dim3(
                Math.Min(properties.MaxGridSize.x, numBlocks),
                Math.Min(properties.MaxGridSize.y, (numBlocks / properties.MaxGridSize.x) + (numBlocks % properties.MaxGridSize.x > 0 ? 1 : 0)),
                1
                );

            int[] constValues = new int[] { pointDim, K, numPoints };
            float[] assignmentSums = new float[pointDim * K];
            int[] assignmentCount = new int[K];

            // Initial centroid locations picked at random
            Random prng = new Random();
            double[] centroids = new double[K * pointDim];
            for (int centroid = 0; centroid < K; centroid++) {
                int point = prng.Next(points.Length / pointDim);
                for (int dim = 0; dim < pointDim; dim++)
                    centroids[centroid * pointDim + dim] = points[point * pointDim + dim];
            }

            // These arrays are only read from on the GPU- they are never written 
            // on the GPU.
            int[] deviceConstValues = gpgpu.CopyToDevice<int>(constValues);
            double[] deviceCentroids = gpgpu.CopyToDevice<double>(centroids);
            double[] devicePoints = gpgpu.CopyToDevice<double>(points);

            // These arrays are written written to on the GPU.
            float[] deviceSums = gpgpu.CopyToDevice<float>(assignmentSums);
            int[] deviceCount = gpgpu.CopyToDevice<int>(assignmentCount);


            // Set up main loop so that no more than maxIter iterations take 
            // place, and that a realative change less than 1% in centroid 
            // positions will terminate the loop.
            int maxIter = 1000;
            double change = 0.0, pChange = 0.0;

            do {
                pChange = change;

                // Clear out the assignments, and assignment counts on the GPU.
                gpgpu.Set(deviceSums);
                gpgpu.Set(deviceCount);

                // Lauch the GPU portion
                gpgpu.Launch(gridSize, blockSize, "assign", deviceConstValues, deviceCentroids, devicePoints, deviceSums, deviceCount);

                // Copy the results memory from the GPU over to the CPU.
                gpgpu.CopyFromDevice<float>(deviceSums, assignmentSums);
                gpgpu.CopyFromDevice<int>(deviceCount, assignmentCount);

                // Compute the new centroid locations.
                double[] newCentroids = new double[centroids.Length];
                for (int centroid = 0; centroid < K; ++centroid)
                    for (int dim = 0; dim < pointDim; ++dim)
                        newCentroids[centroid * pointDim + dim] = assignmentSums[centroid * pointDim + dim] / assignmentCount[centroid];

                // Calculate how much the centroids have changed to decide 
                // whether or not to terminate the loop.
                change = 0.0;
                for (int centroid = 0; centroid < K; ++centroid)
                    for (int dim = 0; dim < pointDim; ++dim) {
                        double d = newCentroids[centroid * pointDim + dim] - centroids[centroid * pointDim + dim];
                        change += d * d;
                    }

                // Update centroid locations on CPU & GPU
                Array.Copy(newCentroids, centroids, newCentroids.Length);
                deviceCentroids = gpgpu.CopyToDevice<double>(centroids);

            } while (change > 0.01 * pChange && --maxIter > 0);

            gpgpu.FreeAll();

            this.Centroids = centroids;
            this.PointDim = pointDim;
        }
    }
}

Python

I include the Python implementation for the sake of demonstrating how scikit-learn was invoked throughout the following experiments section.

model = KMeans(
           n_clusters = numClusters, 
           init='random', 
           n_init = 1, 
           max_iter = 1000, 
           tol = 1e-3, 
           precompute_distances = False, 
           verbose = 0, 
           copy_x = False, 
           n_jobs = numThreads
           );

model.fit(X);    // X = (numPoints, pointDim) numpy array.

Experimental Setup

All experiments where conducted on a laptop with an Intel Core i7-2630QM Processor and NVIDIA GeForce GT 525M GPGPU running Windows 7 Home Premium. C++ and C# implementations were developed and compiled by Microsoft Visual Studio Express 2013 for Desktop targeting C# .NET Framework 4.5 (Release, Mixed Platforms) and C++ (Release, Win32). Python implementation was developed and compiled using Eclipse Luna 4.4.1 targeting Python 2.7, scikit-learn 0.16.0, and numpy 1.9.1. All compilers use default arguments and no extra optimization flags.

For each test, each reported test point is the median of thirty sample run times of a given algorithm and set of arguments. Run time is computed as the time taken to execute model.fit(points, pointDim, numClusters) where time is measured by: QueryPerformanceCounter in C++, System.Diagnostics.Stopwatch in C#, and time.clock in Python. Every test is based on a dataset having two natural clusters at .25 or -.25 in each dimension.

Results

Varying point quantity

point-quantity
Figure X: Left-to-right: C++, C#, Python run time to cluster 10 to 107 two dimensional points in to two clusters.

Both the C++ and C# sequential and parallel implementations outperform the Python scikit-learn implementations. However, the C# parallel implementation outperforms the C++ one, as it seems the overhead associated with multithreading overrides any multithreaded performance gains one would expect. The C# CUDAfy.NET implementation surprisingly does not outperform the C# parallel implementation, but does outperform the C# sequential one as the number of points to cluster increases.

So what’s the deal with Python scikit-learn? Why is the parallel version so slow? Well, it turns out I misunderstood the nJobs parameter. I interpreted this to mean that process of clustering a single set of points would be done in parallel; however, it actually means that the number of simultaneous runs of the whole process will occur in parallel. I was tipped off to this when I noticed multiple python.exe fork processes being spun off which surprised me that someone would implement a parallel routine that way leading to a more thorough reading the scikit-learn documentation. There is parallelism going on with scikit-learn, just not the desired type. Taking that into account the linear one performs reasonably well for being a dynamically typed interpreted language.

Varying point dimension

point-dimension
Figure X: Left-to-right: C++, C#, Python run time to cluster 105, 2 to 27 dimensional points in to two clusters.

The C++ and C# parallel implementations exhibit consistent improved run time over their sequential counterparts. In all cases the performance is better than scikit-learn’s. Surprisingly, the C# CUDAfy.NET implementation does worse than both the C# sequential and parallel implementations. Why do we not better CUDAfy.NET performance? The performance we see is identical to the vary point quantity test. So on one hand it’s nice that increasing the point dimensions did not dramatically increase the run time, but ideally, the CUDAfy.NET performance should be better than the sequential and parallel C# variants for this test. My leading theory is that higher point dimensions result in more data that must be transferred between host and device which is a relatively slow process. Since I’m short on time, this will have to be something I investigate in more detail in the future.

Varying cluster quantity

cluster-quantity
Figure X: Left-to-right: C++, C#, Python run time to cluster 105 two dimensional points in to, 2 to 27 clusters.

As in the point dimension test, the C++ and C# parallel implementations outperform their sequential counterparts, while the scikit-learn implementation starts to show some competitive performance. The exciting news of course is that varying the cluster size finally reveals improved C# CUDAfy.NET run time. Now there is some curious behavior at the beginning of each plot. We get \le 10 \text{ ms} performance for two clusters, then jump up into about \le 100 \text{ ms} for four to eight clusters. Number of points and their dimension are held constant, but we allocate a few extra double’s for the cluster centroids. I believe this has to do with cache behavior. I’m assuming for fewer than four clusters everything that’s needed sits nicely in the fast L1 cache, and moving up to four and more clusters requires more exchanging of data between L1, L2, L3, and (slower) memory memory to the different cores of the Intel Core i7-2630QM processor I’m using. As before, I’ll need to do some more tests to verify that this is what is truly happening.

Language comparison

lang-compare
Figure X: Left-to-right: point quantity, point dimension, and cluster quantity run time summaries for C++, C#, and Python implementations. Columns in yellow are the fastest observed implementation and paradigm for the given test.

For the three tests considered, the C# parallel implementation gave the best run time performance on point quantity and point dimension tests while the C# CUDAfy.NET implementation gave the best performance on the cluster quantity test.

You may be wondering why the C++ implementation doesn’t perform as well as the C# ones. After all, C++ is native and everyone knows that if it has to be fast, it has to be C++ right? Well it boils down to memory allocation at what is, and isn’t included in the run time measurements. In C# when an application is first created a block of memory is allocated for the managed heap. As a result, allocation of reference types in C# is done by incrementing a pointer instead of doing an unmanaged allocation (malloc, etc.). (cf. Automatic Memory Management) This allocation takes place before executing the C# routines, while the same allocation takes place during the C++ routines. Hence, the C++ run times are reported as being slower. Had I implemented memory allocation in C++ the same as it’s done in C#, then the C++ implementation would be undoubtedly faster than the C# ones.

While using scikit-learn in Python is convenient for exploratory data analysis and prototyping machine learning algorithms, it leaves much to be desired in performance; frequently coming ten times slower than the other two implementations on the varying point quantity and dimension tests, but within tolerance on the vary cluster quantity tests.

Future Work

The algorithmic approach here was to parallelize work on data points, but as the dimension of each point increases, it may make sense to explore algorithms that parallelize work across dimensions instead of points.

I’d like to spend more time figuring out some of the high-performance nuances of programming the GPGPU (as well as traditional C++), which take more time and patience than a week or two I spent on this. In addition, I’d like to dig a little deeper into doing CUDA C directly rather than through the convenient CUDAfy.NET wrapper; as well as explore OpenMP and OpenCL to see how they compare from a development and performance-oriented view to CUDA.

Python and scikit-learn were used a baseline here, but it would be worth spending extra time to see how R and Julia compare, especially the latter since Julia pitches itself as a high-performance solution, and is used for exploratory data analysis and prototyping machine learning systems.

While the emphasis here was on trying out CUDAfy.NET and getting some exposure to GPGPU programming, I’d like to apply CUDAfy.NET to the expectation maximization algorithm for fitting multivariate Gaussian mixture models to a dataset. GMMs are a natural extension of k-means clustering, and it will be good to implement the more involved EM algorithm.

Conclusions

Through this exercise, we can expect to see modest speedups over sequential implementations of about 2.62x and 11.69x in the C# parallel and GPGPU implementations respectively when attempting to find large numbers of clusters on low dimensional data. Fortunately the way you use k-means clustering is to find the cluster quantity that maximizes the Bayesian information criterion or Akaike information criterion which means running the vary centroid quanity test on real data. On the other hand, most machine learning data is of a high dimension so further testing (on a real data set) would be needed to verify it’s effectiveness in a production environment. Nonetheless, we’ve seen how parallel and GPGPU based approaches can reduce the time it takes to complete the clustering task, and learned some things along the way that can be applied to future work.

Bibliography

[LiFa89] Li Xiaobo and Fang Zhixi, “Parallel clustering algorithms”, Parallel Computing, 1989, 11(3): pp.275-290.

[MaMi09] Mario Zechner, Michael Granitzer. “Accelerating K-Means on the Graphics Processor via CUDA.” First International Conference on Intensive Applications and Services, INTENSIVE’09. pp. 7-15, 2009.

[Stu82] Stuart P. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28:129-137, 1982.

Written by lewellen

2015-09-01 at 8:00 am

Notes from SIGGRAPH 2015

leave a comment »

Introduction

I recently flew out to Los Angeles to attend the 42nd International Conference and Exhibition on Computer Graphics and Interactive Techniques. SIGGRAPH‘s theme this year was the crossroads of discovery bringing it closer to its roots that began here in Boulder, Colorado back in 1974. For me it was a chance to dig a little deeper into Computer Graphics research following my recent studies and develop a better understanding of the industries pushing the domain forward. As with most posts on this site, this is a short reminder to myself, and hopefully gives others an idea of what they could expect if they went.

Production Sessions

Disney – Pixar’s “Lava”: Moving Mountains was an informative production session detailing the process of bringing “Lava” to the screen. “Lava” is the story of Uku, a lonely volcano in search of love. As millions of years go by, he begins to lose hope as he recedes back into the ocean. But all is not lost. Uku finds renewed hope for love as newly formed volcano Lele rises to the surface. After the Pixar magicians reveal their secrets, technical details, and engrossing backstory, “Lava” becomes an even more enjoyable short film.

The presentation began with director James Murphy explaining his personal story inspiring the short before giving a live performance of the titular song. Colin Levy followed Murphy’s conceptualization, story boarding, and clay mockups with how the film would be framed for maximal emotional impact. Levy explain the exploratory process of filming the opening scene of the film to find the right combination of lenses, and flight paths based on real-world references to help illustrate the size and scale of Uku, the hopeless volcano.

Both Aaron Hartline and Austin Lee continued discussing the challenges of animating and rigging Uku, Lele, a pair of dolphins, birds, whales, and turtles (the last four representing young love, newly weds, established lives, and life long love). In particular, the different approaches for animating and rigging the facial features of Uku (eyelids, lips, checks, and so on) and how the teams iterated to find a balance between what the audience might expect from an anthropomorphic mountain and what they wanted to achieve as story tellers.

Perhaps the most interesting moment in the presentation was Dirk Van Gelder’s sneak peak of the enhancements the team made to Presto (Pixar’s in-house animation tool) to provide animators final render quality real-time feedback of their changes through a clever combination of Renderman-based final renders and OpenGL hardware texturing. Aside from the technical novelty, it’s a great example of time saving enhancements that make it easier for people to freely experiment and explore different approaches leading to better results.

The closing discussion by Byron Bashforth and Farhez Rayani on shading and lighting was informative and it was interesting to see how the procedural approaches were done to give Uku both a physically realistic and visually appealing biome consisting of different shaders, and static and procedural assets. Overall, a very interesting peak into the workflow of one of the most venerable studios in the industry.

Birds of a Feather

Having worked in the healthcare space for a fair bit of time, I was attracted to meetings on Volume Rendering and Medical Visualization and HealthTech: Modeling, Interaction, Hardware, and Analysis to see what people have been working on and to get a glimpse of where things are heading.

volumetric-visualization
Credit: (Left) X3D Example Archives: Basic, Medical: “Skeleton Complete Normals” as seen in H3DViewer. (Right) Virginia Tech Visionarium examples: “Blended body internals”.

Nicholas Polys of Virginia Tech and Michael Aratow (MD) (both chairs of the Web3D Consortium Medical Working Group) began the medical visualization discussion by going over common libraries such as VTK (The Visualization Toolkit) and Voreen (Volume Rendering Engine), before discussing general purpose analysis and visualization tools such as Paraview. Volume oriented applications such as Seg3D (volume segmentation tool), OsiriX (DICOM viewer) were covered and finally, tools for exploring biomolecular systems such as Chimera, VMD (Visual Molecular Dynamics) and PathSim (Epstein-Barr Virus exploration) were discussed giving the audience a good lay of the land. Brief bit of time was given to surgical training tools based on 3D technologies and haptic feedback (e.g. H3D).

These were all interesting applications and seeing how they all work using different types of human-machine interfaces (standard workstations, within CAVE environments, or even in virtual reality headsets and gloves) was eye opening. The second main theme of the discussion was on standardization when it comes to interoperability and reproducibility. There was a heavy push for X3D along with interoperability with DICOM. Like a lot of massive standards, DICOM has some wiggle room in it that leads to inconsistent implementations from vendors. That makes portability of data between disparate systems complicated (not to mention DICOM incorporates non-graphical metadata such as complex HL7). Suffice to say X3D is biting off a big chunk of work, and I think it will take some time for them to make progress in healthcare since it’s a fragmented industry that is not in the least bit technologically progressive.

One area I felt was absent during the discussion was how 3D graphics could be used to benefit everyday patients. There is a wealth of fMRI and ECoG data that patients could benefit from seeing in an accessible way- for example showing a patient a healthy baseline, then accentuating parts of their own data and explaining how those anomalies affect their well-being. If a component can be developed to deliver that functionality, then it can be incorporated into a patient portal alongside all other charts and information that providers have accumulated for the patient.

The HealthTech discussion was presented by Ramesh Raskar, and his graduate students and postdocs from the MIT Media Lab. They presented a number of low-cost, low-power diagnostic devices for retinal imaging and electroretinography, high-speed tomography, cellphone-based microscopy, skin perfusion photography, and dental imaging. Along with more social oriented technologies for identify safe streets to travel, and automatically discerning mental health from portraits. There were plenty of interesting applications being developed by the group, but it was more of a show and tell by the group than discussing the types of challenges beyond the scope of the work by MIT Media Lab (as impressive as they are). (For example, The fine work 3Shape A/S has done with fast scanning of teeth for digital dentistry.)

One thing that was discussed of key interest was Meddit a way for medical practitioners and researchers to define open problems to maturity, then presenting those challenges to computer scientists to work on and develop solutions. While the company name is uninspired, I think this is the right kind of collaboration platform for the “toolmaker” view of hardware engineers, computer scientists and software engineers as it identifies a real issue, presents an opportunity, and gives a pool of talented, bright people a way to make a difference. I am skeptical that it will take off (I think it would have more success as a niche community within an umbrella collaboration platform- i.e. Stack Exchange model), but the idea is sound and something people should get excited about.

Real-Time Live!

The challenge of real-time graphics is very appealing to me and getting to see what different software studios are working on was a real treat. While there were several presentations and awards given during the two hour long event, three demos stood out to me. Balloon Burst given by Miles Macklin of NVIDIA, BabyX presented by Mark Sagar of University of Auckland, and award winner A Boy and His Kite demoed by Nick Penwarden of Epic Games.

balloon-burst
Credit: Figure 1 of Fast Grid-Free Surface Tracking. Chentanez, Mueller, Macklin, Kim. 2015.

Macklin’s demo was impressive in that it simulated more than 750,000 particles (250,000 by their solver Flex, and 512,000 for mist and droplets) and their paper [pdf] Fast Grid-Free Surface Tracking gave some technical background into how they achieved their results. Fluid simulation is something I’d like to spend some time exploring, obviously won’t be able to create something as technical as Macklin’s group, but would like to spend some time on Smoothed-Particle Hydrodynamics, and seeing NVIDIA’s work was a good motivation boost to explore the subject further on my own.

Perhaps the most unexpected entry in the series was Sagar’s BabyX. It was a fascinating assemblage of neural networks, real time graphics, natural language processing, computer vision, and image processing to create the ultimate “Sims” like character- a baby that could learn and invoke different emotional responses based on external stimuli. Real-time graphics were photorealistic, and seeing the modeling behind the system to emulate how the brain behaves in the presence of different dopamine levels (and how those levels correspond to things like Parkinson’s and schizophrenia) was impressive as well. Overall, a fantastic technical achievement and I look forward to following Sagar’s work as it continues to evolve.

My main interest in going to Real-Time Live! was to see Penwarden’s work on A Boy and His Kite. This impressive demo spanning hundred square miles inspired by the Isle of Skye really puts to shame my prior work in creating procedural environments. Nonetheless, it goes to show to far the medium can be pushed and how small the divide between real-time and film is becoming. Computer Graphics World published (July-August 2015) a very thorough technical overview [p. 40-48] of how Penwarden’s team produced the short, in addition to the features added to Unreal Engine 4 to make the demo shine.

Wrap-up

There were many other things I explored that I won’t go into detail- namely the VR Village, Emerging Technologies, Research Posters, Exhibition, and Job Fair. I’m still quite skeptical that virtual reality (and to the same extent augmented reality) technologies will come into the mainstream; I think they’ll continue to be the subject of researchers, gaming enthusiasts, and industry solutions for automotive, and healthcare problems. One thing that was a bit of a disappointment was the Job Fair as there were barely any companies participating. Overall, a positive experience learning what other people are doing in the industry, and getting to see how research is being applied in a variety of different domains including automotive, entertainment, engineering, healthcare, and science.

Written by lewellen

2015-08-14 at 1:41 pm

Posted in Computer Graphics

Tagged with ,

Algorithms for Procedurally Generated Environments

with one comment

Fig. 1: Demonstration of different graphics techniques and features of the procedurally generated environment.

Introduction

I recently completed a graduate course in Computer Graphics that required us to demonstrate a significant understanding of OpenGL and general graphics techniques. Given the short amount of time to work with, I chose to work on creating a procedurally generated environment consisting of land, water, trees, a cabin, smoke, and flying insects. The following write-up explains my approach and the established algorithms that were used to create the different visual effects showcased in the video above.

Terrain

terrain-2
Fig. 2: First few iterations of the Midpoint Displacement algorithm.

There are a variety of different techniques for creating terrain. More complex ones rely on visualizing a three dimensional scalar field, while simpler ones visualize a two dimensional surface defined by a fixed image, or dynamically using a series of specially crafted functions or random behavior. I chose to take a fractal-based approach given by [FFC82]’s Midpoint Displacement algorithm. A two dimensional grid of size (2^n + 1)^2 (for n > 1) is allocated with each entry representing the height of the terrain at that row and column position. Beginning with the corners, the corresponding four midpoints, as well as singular center, are found and their height calculated by drawing from a uniform random variable whose support is given by the respective corners. The newly assigned values now form four separate squares which can be assigned values recursively as shown above. The Midpoint Displacement algorithm produces noisy surfaces that are jagged in appearance. To smooth out the results, a 3 \times 3 Gaussian Filter is applied to the surface twice in order to produce a more natural, smooth looking surface.

Face normals can be found by taking the cross product of the forward differences in both the row and column directions, however this leads to a faceted looking surface. For a more natural appearance, vertex normals are calculated using central differences in the row and column direction. The cross product of these approximations then gives the approximate surface normal at each vertex to ensure proper lighting.

terrain-texturing
Fig. 3: Example process of blending an elevation band texture with the base texture to create geographically correct looking terrain textures.

Texturing of the terrain is done by dynamically blending eight separate textures together based on terrain height as shown above. The process begins by loading in a base texture which is applied to all heights of the terrain. The next texture to apply is loaded, and an alpha mask is created and applied to the next texture based on random noise and a specific terrain height band. Blending of the masked texture and base texture is a function of the terrain’s height where the normalized height is passed through a logistic function to decide what portion of each texture should be used. The combined texture then serves as a new base texture and the process repeats until all textures have been blended together.

Water

In order to generate realistic looking water, a number of different OpenGL abilities were employed to accurately capture a half dozen different water effects consisting of reflections, waves, ripples, lighting, and Fresnel effects. Compared to other elements of the project, water required the largest graphics effort to get right.

reflection-passes
Fig. 4: The different rendering passes for creating realistic looking water with reflections and refraction.

To obtain reflections, a three-pass rendering process is used. In the first pass, the scene is clipped and rendered to a frame buffer (with color and depth attachments) from above the water revealing only what is below the surface for the refraction effects. The second pass clips and renders the scene to a frame buffer (with only a color attachment) from below the water surface revealing only what is above the water for the reflection effects. The third pass then combines these buffers on the water surface through vertex and fragment shaders to give the desired appearance. To map the frame buffer renderings to the water surface, the clipped coordinates calculated in the vertex shader are converted to normalized device coordinates through perspective division in the fragment shader which allows one to map the (u, v) coordinate of the texture as it should appear on the screen to coordinates of the water surface.

shader-effects
Fig. 5: Reflection, ripples normal map, water depth, refraction, specular normal map, and depth buffer textures used to achieve different visual effects.

To create the appearance of water ripples, a normal map is sampled and the resulting time varying displacement is used to sample the reflection texture for the fragment’s reflection color. Next, a similar sampling of the normal map is done at a coarser level to emulate the specular lighting that would appear on the subtle water waves created by the vertex shader. Refraction ripples and caustic lighting are achieved by sampling from the normal map just as the surface ripples and specular lighting were. To make the water appear cloudy, the depth buffer from the refraction rendering is used in conjunction with water depth so that terrain deeper under water is less visible as it would be in real life.

To combine the reflection and refraction components, the Fresnel Effect is used. This effect causes the surface of the water to vary in appearance based on viewing angle. When the viewer’s gaze is shallow to the water surface, the water is dominated by the reflection component, while when gazing downward, the water is more transparent giving way to the refraction component. The final combination effect is to adjust the transparency of the texture near the shore so that shallower water reveals more of the underlying terrain.

Flora

flora-examples
Fig. 6: Examples of plants generated by using the Stochastic Lindenmayer Systems framework.

The scene consists of single cottonwood trees, but the underlying algorithm based on Stochastic Lindermayer Systems [Lin68] which can produce a large variety of flora as shown above. The idea is that an n-ary tree is created with geometric annotations consisting of length and radius, and relative position, yaw, and pitch to its parent node. Internal nodes of the tree are rendered as branches or stems, while leaf nodes as groups of leaves, flowers, fruits and so on. Depending on the type of plant one wishes to generate, these parameters take on different values, and the construction of the n-ary tree varies. For a palm tree, a linked list is created, whereas a flower may have several linked lists sharing a common head, and a bush may be a factorial tree with a regular pitch.

Smoke

smoke
Fig. 7: Examples of smoke plumes generated using random particles, Metaballs, and Marching Tetrahedra algorithms.

The primary algorithmic challenge of the project was to visualize smoke coming from the chimney of the cabin in the scene. To achieve this, a simple particle system was written in conjunction with [Bli82]’s Metaballs and [PT+90]’s Marching Tetrahedra algorithms. The tetrahedral variant was chosen since it easier to implement from scratch than [LC87]’s original Marching Cubes algorithm. The resulting smoke plumes produced from this chain of algorithms is shown above.

particles
Fig. 8: Visual explanation of the smoke generation process in two dimensions.

Particles possess position and velocity three dimensional components, and are added at a fixed interval to the system in batches with random initial values on the xy-plane and zero velocity terms. A uniform random vector field is created with random x, y components and fixed, positive z component. Euler’s Forward Method is applied to the system to update each particle’s position and velocity. Any particles that escape from the unit bounding cube are removed from the system. This process produces the desired Brownian paths that are typical of smoke particles. To visualize each particle, the Metaballs algorithm is used to create a potential field about each particle. The three dimensional grid is populated in linear time with respect to the number of particles by iterating over a fixed volume about each particle since there is no need to go outside of the fixed volume where the point’s potential field is surely zero.

marching
Fig. 9: Cube segmented into 6 tetrahedra, and the two primary cases of the Marching Tetrahedra algorithm.

The resulting scalar field from this process is passed along to the Marching Tetrahedra algorithm. The algorithm will inspect each volume of the grid in cubic time with respect to the grid edge size. The eight points of the volume are then assigned inside / outside labellings with those volumes completely inside or outside ignored. Those having mixed labellings contain a segment of the surface we wish to render. A single volume is segmented into 6 tetrahedra[1], with two tetrahedra facing each plane resulting in a common corner shared by all as shown above. Each tetrahedron then has sixteen cases to examine leading to different surfaces. Two of these cases are degenerate; all inside or all outside. The remaining fourteen cases can be reduced to two by symmetry as shown above. To ensure the surface is accurate, the surface vertices are found by linearly interpolating between inside / outside grid points.

Face normals can be computed directly from the resulting surface planes by taking the usual cross product. In order to calculate vertex normals, numerical differentiation is used to derive the gradient of the scalar field at the grid point using backward, central, and forward differences depending on availability. Taking the calculated normals at each grid point, the surface normals at the previously interpolated surface vertices are then the linear interpolation of the corresponding grid point normals. Given more time, I would have liked to put more time into surface tracking and related data structures to reduce the cubic surface generation process down to just those volumes that require a surface to be drawn.

Butterflies

pursuit
Fig. 10: Example series of movements and completed pursuit curve of predator and prey.

For the final stretch goal, the proposed static options were eschewed in favor of adding in a dynamic element that would help bring the scene to life without being obtrusive. As a result, a kaleidoscope of butterflies that meander through the scene were introduced.

Each butterfly follows a pursuit curve as it chases after an invisible particle following a random walk. Both butterfly and target are assigned positions drawn from a uniform random variable with unit support. At the beginning of a time step, the direction to where the particle will be at the next time step from where the butterfly is at the current time step is calculated, and the butterfly’s position is then incremented in that direction. Once the particle escapes the unit cube, or the butterfly catches the particle, then the particle is assigned a new position, and the game of cat and mouse continues.

Each time step the butterfly’s wings and body are rotated a slight amount by a fixed value, and by the Euler angles defined by its direction to give the correct appearance of flying. To add variety to the butterflies, each takes on one of three different appearances (Monarch, and Blue and White Morpho varieties) based on a fair dice roll. One flaw with the butterflies is that they do not take into account the positions of other objects in the scene and can be often seen flying into the ground, or through the cabin.

Conclusions

This project discussed a large variety of topics related to introductory computer graphics, but did not cover other details that were developed including navigation, camera control, lighting, algorithms for constructing basic primitives, and the underlying design of the C++ program or implementation of the GLSL shaders. While most of the research applied to this project dates back nearly 35 years, the combination of techniques lends to a diverse and interesting virtual environment. Given more time, additional work could be done to expand the scene to include more procedurally generated plants, objects, and animals, as well as additional work done to make the existing elements look more photorealistic.

References

[Bli82] James F. Blinn. A generalization of algebraic surface drawing. ACM Trans. Graph., 1(3):235-256, 1982.

[FFC82] Alain Fournier, Donald S. Fussell, and Loren C. Carpenter. Computer render of stochastic models. Commun. ACM, 25(6):371-384, 1982.

[LC87] William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1987, pages 163-169, 1987.

[Lin68] Astrid Lindenmayer. Mathematical models for cellular interactions in development i. filaments with one-side inputs. Journal of theoretical biology, 18(3):280-299, 1968.

[PT+90] Bradley Payne, Arthur W Toga, et al. Surface mapping brain function on 3d models. Computer Graphics and Applications, IEEE, 10(5):33-41, 1990.

Deep Learning for Automatic Speech Recognition

leave a comment »

Introduction

The problem of automatic speech recognition, and details of the traditional Hidden Markov Model and Gaussian Mixture Model hybrid architecture (HMM-GMM) for acoustic modeling are detailed in [JM08], but will be skipped here. Instead, the focus of this literature review is to discuss how [DYDA12] uses a context dependent Hidden Markov Model and Deep Neural Network hybrid architecture (CD-HMM-GMM) for acoustic modeling as it represents a significant improvement over the traditional HMM-GMM approach. This review will begin with motivation for the architecture, then go into detail the algorithms used for pre-training, and outline the algorithms used for training before concluding with how well the approach outperforms the standard HMM-GMM approach.

Architecture

To motivate their architecture, [DYDA12] rely on the standard noisy channel model for speech recognition presented in [JM08] where we wish to maximize the likelihood of a decoded word sequence given our input audio observations:

\displaystyle \hat{w} = \underset{w \in L}{\text{argmax }} \mathbb{P} \left( w \lvert x \right ) = \underset{w \in L}{\text{argmax }} \mathbb{P} \left( x \lvert w \right ) \mathbb{P} \left(  w  \right ) (1)

Where \mathbb{P} \left( w \right ) and \mathbb{P} \left( x \lvert w \right ) represent the language and acoustic models respectively. [JM08] state that the language model can be computed via an N-gram model; [DYDA12] acknowledge using this approach, but focus their efforts into explaining their acoustic model:

\displaystyle \mathbb{P} \left( x \lvert w \right ) = \sum_{q}  \mathbb{P} \left( x, q \lvert w \right ) \mathbb{P} \left( q \lvert w \right ) \approxeq \max \pi(q_0) \prod_{t = 1}^T a_{q_{t-1} q_t} \prod_{t=0}^T \mathbb{P} \left( x_t \lvert q_t \right ) (2)

Here the acoustic model is viewed as a sequence of transitions between states of tied-state triphones which [DYDA12] refer to as senones giving us the context dependent aspect of the architecture. [FLMS14] explains that senones represent the pronunciation of words and are derived by decision trees. By tying triphone states together, this approach is able to avoid having to process a large number of triphones and avoid the likely sparseness of training examples for every possible triphone.

The model assumes that there is a probability \pi(q_0) for the starting state, probabilities a_{q_{t-1} q_{t}} of transitioning to the state observed at step t -1 to step t, and finally, the probability of the acoustics given the current state q_t. [DYDA12] expand this last term further into:

\displaystyle \mathbb{P} \left( x_t \lvert q_t \right ) = \frac{\mathbb{P} \left( q_t \lvert x_t \right ) \mathbb{P} \left( x_t \right ) }{\mathbb{P} \left( q_t \right ) } (3)

Where \mathbb{P} \left( x_t \lvert q_t \right ) models the tied triphone senone posterior given mel-frequency cepstral coefficients (MFCCs) based on 11 sampled frames of audio. While MFCCs come from signal processing, they have proven to be effective features for automatic speech recognition. Based on the power spectrum derived from sample audio frames, MFCCs represent characteristics of the audio that our ears are sensitive to as explained in [Ada10]. \mathbb{P} \left( q_t \right ) is the prior probability of the senone, and \mathbb{P} \left( x_t \right ) can be ignored since it does not vary based on the decoded word sequence we are trying to find.

Based on this formalism, [DYDA12] chose to use a pre-trained Deep Neural Network to estimate \mathbb{P} \left( q_t \lvert x_t \right ) using MFCCs as DNN inputs and taking the senone posterior probabilities as DNN outputs. The transitioning between events is best modeled by a Hidden Markov Model whose notation, \pi, a, \text{and } q appears in Eq. (2). Now that we have an overview of the general CD-DNN-HMM architecture, we can look at how [DYDA12] train their model.

Pre-Training

Given the DNN model we wish to fit the parameters of the model to a training set. This is usually accomplished by minimizing a likelihood function and deploying a gradient descent procedure to update the weights. One complication to this approach is that the likelihood can be computationally expensive for multilayer networks with many nodes rendering the approach unusable. As an alternative, one can attempt to optimize a computationally tractable surrogate to the likelihood. In this case the surrogate is the contrastive divergence method developed by [Hin02]. This sidestep enabled [HOT06] to develop an efficient unsupervised greedy pre-training process whose results can then be refined using a few iterations of the traditional supervised backpropagation approach. In this portion of the paper we discuss the work of [Hin02] and explain the greedy algorithm of [HOT06] before going on to discuss the high-level training procedure of [DYDA12].

To understand the pre-training process, it is necessary to discuss the Restricted Boltzmann Machine (RBM) and Deep Belief Network (DBN) models. RBMs are an undirected bipartite graphical model with Gaussian distributed input nodes in a visible layer connecting to binary nodes in a hidden layer. Every possible arrangement of hidden, h, and visible, v, nodes is given an energy under the RBM model:

\displaystyle E(v, h) = - b^T v - c^T h - v^T W h (4)

Where W is the weight of connections between nodes and vectors b and c correspond to the visible and hidden biases respectively. The resulting probability is then given by:

\displaystyle \mathbb{P} \left( v, h \right ) = \frac{e^{-E(v, h)}}{Z} (5)

Where Z is a normalization factor. Based on the assumptions of the RBM, [DYDA12] derive expressions for \mathbb{P} \left( h = 1 \lvert v \right ) and \mathbb{P} \left( v = 1 \lvert h \right ) given by:

\displaystyle \mathbb{P} \left( h = 1 \lvert v \right ) = \sigma(c + v^T W) \qquad \mathbb{P} \left( v = 1 \lvert h \right ) = \sigma(b + h^T W^T) (6)

Where \sigma is an element-wise logistic function. [DYDA12] argue that Eq. (6) allows one to repurpose the RBM parameters to initialize a neural network. Training of the RBM is done by stochastic gradient descent against the negative log likelihood since we wish to find a stable energy configuration for the model:

\displaystyle - \frac{\partial \ell(\theta)}{\partial w_{ij}} = \langle v_i h_j \rangle_\text{data} - \langle v_i h_j \rangle_\text{model} (7)

however [DYDA12] point out that the gradient of the negative log likelihood cannot be computed exactly since the \langle \cdot \rangle_\text{model} term takes exponential time. As a result, the contrastive divergence method is used to approximate the derivative:

\displaystyle - \frac{\partial \ell(\theta)}{\partial w_{ij}} = \langle v_i h_j \rangle_\text{data} - \langle v_i h_j \rangle_\text{1} (8)

where \langle \cdot \rangle_\text{1} is a single step Gibbs sampled expectation. These terms are expectations in which nodes i \text{ and } j are simultaneously active given the training data and model. Given this insight, regular stochastic gradient descent can be performed and the parameters of a RBM fitted to training data.

Now that we have an understanding of RBMs, we can shift our focus to DBNs. A Deep Belief Network is a multilayer model with undirected connections between the top two layers and directed between other layers. To train these models, [HOT06] had the insight to treat adjacent layers of nodes as RBMs. One starts with the bottom two layers and trains them as though they were a single RBM. Once those two layers are trained, then the top layer of the RBM is treated as the input layer of a new RBM with the layer above that layer acting as the hidden layer of the new RBM. The sliding window over the layers continues until the full DBN is trained. After this, [HOT06] describe an “up-down” algorithm to further refine the learned weights. The learned parameters of this greedy approach can then be used as the parameters of a DNN as explained earlier in the discussion of Eq. (6).

Training

Training of the CD-DNN-HMM model consists of roughly a dozen involved steps. We won’t elaborate here on the full details of each step, but will instead provide a high-level sketch of the procedure to convey its general mechanics.

The first high-level step of the procedure is to initialize the CD-DNN-HMM model. This is done by first training a decision tree to find the best tying of triphone states which are then used to train a CD-GMM-HMM system. Next, the unique tied state triphones are each assigned a unique senone identifier. This mapping will then be used to label each of the tied state triphones. (These identifiers will be used later to refine the DNN.) Finally, the trained CD-GMM-HMM is converted into a CD-DNN-HMM by retaining the triphone and senone structure and HMM parameters. This resulting DNN goes through the previously discussed pre-training procedure.

The next high-level step iteratively refines the CD-DNN-HMM. To do this, first the originally trained CD-GMM-HMM model is used to generate a raw alignment of states which is then mapped to its corresponding senone identifier. This resulting alignment is then used to refine the DBN by backpropagation. Next, the prior senone probability is estimated based on the number of frames paired with the senone and the total number of frames. These estimates are then used to refine the HMM transition probabilities to maximize the features. Finally, if this newly estimated parameters do not improve accuracy against a development set, then the training procedure terminates; otherwise, the procedure repeats this high-level step.

Experimental Results

System Configurations

[DYDA12] report that their system relies on nationwide language model consisting of 1.5 million trigrams. For their acoustic model, they use a five hidden layer DNN with each layer containing 2,048 hidden units. Training the system from scratch on 24 hours of training data takes four days on a Dell T3500 workstation with an NVIDIA Tesla GPU. [DYDA12] emphasize the importance of the GPU in obtaining acceptable training time, and that without it, training time would be 30x slower.

Datasets and Metrics

Comparison of automatic speech recognition system consists of three principle error metrics: sentence (SER), word (WER), and phoneme (PER) error rates. These look at the ratio of incorrect entities to the number of total entities with the exception of word error rate which uses a Levenshtein approach to measure the number of insertions, substitutions, and deletions relative to the total number of words. A sentence is considered incorrect if there is at least one incorrect word.

These error metrics often coincide with different datasets, in particular WER is reported for Switchboard, SER for Bing Mobile Voice Search (BMVS), and PER on TIMIT. Switchboard is a collection of phone conversations between two people, while BMVS is a collection of short spoken questions such as “The Med” or “Chautauqua Park” that are used to find these locations, while TIMIT is a phonetic focused corpus of spoken sentences that are phonetically rich.

Results

  Switchboard BMVS TIMIT
  (WER) (SER) (PER)
GMM 23.6[2] 36.2[1] 21.7[2]
DNN 16.1[2] 30.4[1] 21.9[3]
CNN 20.2[3]
RNN 17.7[4]
Comparison of different architectures on different datasets and their corresponding datasets as reported from the following sources: [1] [DYDA12], [2] [DBL12], [3] [AMJ+14], [4] [GMH13].

Direct comparison of models is complicated by the variety of error metrics and datasets; [DBL12] is used to fill in these gaps to make a meaningful comparison. As one can see from Table (1), the neural network approaches do better on average over the traditional GMM approach. To illustrate that it is not only DNN approaches that do better, the work of [AMJ+14] using a Convolutional Neural Network (CNN) and [GMH13] using a Recurrent Neural Network (RNN) are included to further drive the point that neural network architectures are viable alternatives to GMMs.

Conclusions

[DYDA12], [AMJ+14], and [GMH13] have shown that neural network architectures exhibit better performance over Gaussian Mixture Models. [DYDA12] believes that a more capable first layer model provided by mean-covariance restricted Boltzmann machines will increase performance, while [AMJ+14] plans to investigate unexpected improvements in large-vocabulary speech recognition where they were absent in phone recognition tasks when using convolutional restricted Boltzmann machines. Both routes seem promising and are likely to produce improved error rates inline with [GMH13]’s results.

In [DBL12], the authors of both research groups suggest key gains will come from improved understanding of the pre-training process and how the types of units used in these models affect error rates. They conclude that distributed training is the largest hurdle to overcome for these systems to make use of more training data. (Parallelization is limited by the sequential stochastic gradient descent at the heart of the pre-training and training processes.) As [DYDA12] point out in their paper, GPU-based approaches can assist in reducing computation time, but more foundational approaches need to be pursued.

In a 2014 talk [Hin14], Hinton criticizes existing neural network architectures on philosophical grounds arguing that they do not correspond well enough to how the brain functions citing inadequate structural complexity. His proposed solution is a new neural network approach that clusters neurons together into capsules, which he believes will better model how the cortical columns of the brain behave. If Hinton is right (which his track record suggests), then it is likely we’ll see this capsule approach outperform existing models, and consequently, yield improved error rates in automatic speech recognition.

References

[Ada10] Andre Gustavo Adami. Automatic speech recognition: From the beginning to the portuguese language. In The Int. Conf. on Computational Processing of Portuguese (PROPOR). Rio Grande do Sul: Porto Alegre, 2010.

[AMJ+14] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Ui. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech & Language Processing, 22(10):1533-1545, 2014.

[DBL12] Deep neural networks for acoustic modeling in speech recognition: The shared views of four research grounds. IEEE Signal Process. Mag., 29(6):82-97, 2012.

[DYDA12] George E. Dahl, Dong Ui, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, speech & Language Processing, 20(1):30-42, 2012.

[FLMS14] Luciana Ferrer, Yun Lei, Mitchell McLaren, and Nicolas Scheffer. Spoken language recognition based on senone posteriors. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communnication Association, Singapore, September 14-18, 2014, pages 2150-2154. ISCA, 2014.

[GMH13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton, Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 6645-6649, 2013.

[Hin02] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771-1800, 2002.

[Hin14] Geoffrey E. Hinton. What’s wrong with convolutional nets? Massachusetts Institute of Technology, Department of Brain and Cognitive Sciences, Fall Colloquium Series, 2014.

[HOT06] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527-1554, 2006.

[JM08] Daniel Jurafsky and James H. Martin. Speech and Language Processing, 2nd Edition. Prentice Hall, 2008.

Large-Scale Detection and Tracking of Aircraft from Satellite Images

leave a comment »

Abstract
In this paper a distributed system for detecting and tracking commercial aircraft from medium resolution synthetic satellite images is presented. Discussion consisting of the system’s Apache Spark distributed architecture, correlation-based template matching, heuristic-based tracking, and evaluation of the system’s Standalone, Cluster, and Cloud modes of operation are covered before concluding with future work. Experimental results support 10 m detection accuracy, 96% detection rate, and 200 ms/image and 10 mb/sec Cloud processing performance. The goal of this work is to demonstrate that a satellite-based aircraft tracking system is feasible and that the system’s oversimplifying assumptions offer a baseline which similar real-world systems may be compared. Applications of this system include real-time tracking of commercial aircraft, flight deviation, and the automatic discovery of commercial aircraft from historic imagery. To the best of the author’s knowledge, a system of this nature has not been published publicly.

Introduction

Motivation

Historically, search and rescue operations have relied on on-site volunteers to expedite the recovery of missing aircraft. However, on-site volunteers are not always an option when the search area is very broad, or difficult to access or traverse as was the case when Malaysia Airlines Flight 370 disappeared over the Indian Ocean in March, 2014. Crowd-sourcing services such as tomnod.com have stepped up to the challenge by offering volunteers to manually inspect satellite images for missing aircraft in an attempt to expedite the recovery process. This process is effective, but slow. Given the urgency of these events, an automated image processing solution is needed.

Prior Work

The idea of detecting and tracking objects from satellite images is not new. There is plenty of academic literature on detection or tracking, but often not both for things like oil tankers, aircraft, and even clouds. Most distributed image processing literature is focused on using grid, cloud, or specialized hardware for processing large streams of image data using specialized software or general platforms like Hadoop. Commercially, companies like DigitalGlobe have lots of satellite data, but haven’t publicized their processing frameworks. Start-ups like Planet Labs have computing and satellite resources capable of processing and providing whole earth coverage, but have not publicized any work on this problem.

The FAA has mandated a more down to earth approach through ADS-B. By 2020, all aircraft flying above 10,000 ft will be required to have Automatic dependent surveillance broadcast transceivers in order to broadcast their location, bearing, speed and identifying information so that they can be easily tracked. Adoption of the standard is increasing with sites like Flightrader24.com and Flightaware.org providing real-time access to ongoing flights.

Problem Statement

Fundamentally, there are two variants of this problem: online and offline. The offline variant most closely resembles the tomnod.com paradigm which is concerned with processing historic satellite images. The online variant most closely resembles air traffic control systems where we’d be processing a continual stream of images coming from a satellite constellation providing whole earth coverage. The focus of this work is on the online variant since it presents a series of more interesting problems (of which the offline problem can be seen as a sub-problem.)

In both cases, we’d need to be able to process large volumes of satellite images. One complication is that large volumes of satellite images are difficult and expensive to come by. To work around this limitation, synthetic images will be generated based off OpenFlights.org data with the intent of evaluating natural images on the system in the future. To generate data we’ll pretend there are enough satellites in orbit to provide whole earth coverage capturing simulated flights over a fixed region of space and window of time. The following video is an example of the approximately 60,000 flights in the dataset being simulated to completion:

To detect all these aircraft from a world-wide image, we’ll use correlation based template matching. There are many ways to parallelize and distribute this operation, but an intuitive distributed processing of image patches will be done with each cluster node performing a parallelized Fast Fourier Transform to identify any aircraft in a given patch. Tracking will be done using an online heuristic algorithm to “connect the dots” recovered from detection. At the end of the simulation, these trails of dots will be paired with simulated routes to evaluate how well the system did.

The remainder of this post will cover the architecture of the system based on Apache Spark, its configurations for running locally and on Amazon Web Services, and how well it performs before concluding with possible future work and cost analysis of what it would take to turn this into a real-world system.

System Architecture

Overview

pipeline
Figure 1: Data pipeline architecture.

The system relies on the data pipeline architecture presented above. Data is read in from the OpenFlights.org dataset consisting of airport information and flight routes. A fixed number of national flights are selected and passed along to a simulation module. At each time step, the simulator identifies which airplanes to launch, update latitude and longitude coordinates, and remove those that have arrived at their destination.

To minimize the amount of network traffic being exchanged between nodes, flights are placed into buckets based on their current latitude and longitude. Buckets having flights are then processed in parallel by the Spark Workers. Each worker receives a bucket and generates a synthetic satellite image; this image is then given to the detection module which in turn recovers the coordinates from the image.

These coordinates are coalesced at the Spark Driver and passed along to the tracking module where the coordinates are appended to previously grown flight trails. Once 24 hours worth of simulated time has elapsed (in simulated 15 minute increments), the resulting tracking information is passed along to a reporting module which matches the simulated routes with the flight trails. These results are then visually inspected for quality.

Simulation Assumptions

All latitude and longitude calculations are done under the Equirectangular projection. A corresponding flight exists for each route in the OpenFlights.org dataset (Open Database License). Flights departing hourly follow a straight line trajectory between destinations. Once en route, flights are assumed to be Boeing 747s traveling at altitude of 35,000 ft with a cruising speed of 575 mph.

Generation

grids
Figure 2: Conceptual layering and data representations between the modules of the pipeline. World silhouette by Wikimedia Commons.

Flights are mapped to one of 4000 \times 8000 buckets based on each flight’s latitude and longitude coordinate. Each bucket spans a 0.045 \times 0.045 degree region as illustrated in the middle layer of Fig. (2). Given a bucket, a 512 \times 512 oversimplified synthetic medium-resolution monochromatic satellite image is created with adorning aircraft silhouettes for each 71 \times 65 m Boeing 747 airliner in the bucket. (Visual obstructions such as clouds or nightfall will not be depicted.) This image, in addition to the latitude and longitude of the top-left and bottom-right of the image, are then passed along to the detection module.

Detection

Given an image and its world coordinate frame, the detection module performs textbook Fourier-based correlation template matching to identify silhouettes of airplanes, X, in the image, Y:

\displaystyle Z = \mathcal{F}^{-} \left( \mathcal{F}(X) \circ \mathcal{F}(Y) \right ) (1)

Where the two-dimensional Discrete Fourier Transform and inverse transform are defined as:

\displaystyle \mathcal{F}(f)(u,v) = \frac{1}{M N} \sum_{m = 0}^{M - 1} \sum_{n = 0}^{N - 1} f(m, n) \exp{ \left( -2 \pi i \left( \frac{mu}{M} + \frac{nv}{N} \right ) \right ) } (2)

\displaystyle \mathcal{F^{-}}(F)(m,n) = \sum_{u = 0}^{M - 1} \sum_{v = 0}^{N - 1} F(u, v) \exp{ \left( 2 \pi i \left( \frac{mu}{M} + \frac{nv}{N} \right ) \right ) } (3)

To carry out these calculations efficiently, a parallelized two-dimensional Fast Fourier Transform (FFT) \mathcal{O}(N \log N) time algorithm was implemented for both forward and inverse operations by leveraging the fact that operations (2), (3) can be factored so that the FFT can be computed row-wise, and then on those results column-wise. Hadamard (element-wise) product of the frequency domain representation of the airplane and satellite image is done naively in quadratic time.

To denoise the results, the recovered spatial product, Z, is thresholded so that any real values greater than 90% of the product’s real maximum, Z^*, are kept:

\displaystyle Z_{x,y} = \Re{ \left( Z_{x, y} \right ) } 1_{ [ \Re{ \left( Z_{x, y} \right ) } \ge 0.9 Z^{*}  ] } (4)

Since there are many values greater than the threshold, a linear time (in number of nodes) connected component labeling algorithm is applied to identify the most likely aircraft locations. The algorithm treats each pixel of the image as a node if that pixel’s value is greater than the threshold. An edge exists between two nodes if the nodes’ pixel coordinates are within an \ell_\infty distance of two. The centroid of each connected component is then taken to be the true coordinate of each detected aircraft. Next only those centroids derived from clusters having more than half the average number of pixels per cluster are kept. Finally, these centroids are transformed to latitude and longitude coordinates given the world coordinate frame.

Tracking

The tracking module uses an 181 \times 361 grid of buckets with each bucket representing approximately a square degree region as illustrated as the top layer in Fig. (2). Each individual bucket consists of a stack of sightings where a sighting is a timestamped collection of coordinates. Here an individual coordinate is allowed to “connect” up to two other coordinates. Coordinates connected in this fashion form a trail, which is the primary output of the module.

tracking-heuristics
Figure 3: (Left) Collinear and (Right) directional tracking heuristics. Blue points C represent coordinates that would be accepted under these heuristics and red points C^\prime that would be rejected.

For each latitude and longitude coordinate from the detection module, d \in D, the tracking module picks all the previous time step’s coordinates, p \in P, from the neighboring (\ell_\infty \le 5) buckets of d‘s bucket. From P, only those coordinates that satisfy the following criteria are considered:

  • p must be free to “connect” to another coordinate.
  • d must be collinear to the coordinates of the trail headed by p, i.e., \lvert AC - AB - BC \rvert \le \epsilon as in Fig. (3).
  • Given the predecessor of p, the inner product of the vectors formed from the predecessor to p and d must be positive, i.e., \langle \overrightarrow{AB}, \overrightarrow{AC} \rangle > 0 as in Fig. (3).

Next, the nearest neighbor of p is chosen from this remaining set of points. If a nearest neighbor exists, then p is appended to the end of the nearest neighbor’s trail, otherwise a new trail is created. Finally, p is added to its designated bucket so that it can be used for future trail building.

When the simulation completes, all trails from tracking module are analyzed and matched to the known routes used in the simulation. Matching is done by minimizing the distance from the trail’s origin and destination to a route’s origin and destination respectively. If the mean orthogonal distance:

\displaystyle MOD(x) = \frac{1}{N} \sum_{i} \frac{ \lvert \langle w, x_i \rangle + b \rvert  }{ \lVert w \rVert  } (5)

from the coordinates in the trail to the line formed by connecting the route’s origin and destination is greater than 25 m, then the match is rejected.

Reporting

The reporting module is responsible for summarizing the system’s performance. The average mean orthogonal distance given by Eqn. (5) is reported for all identified routes, total number of images processed and coordinates detected, and the portion of routes correctly matched is reported.

System Configurations

Standalone

standalone
Figure 4: Standalone configuration components.

Standalone mode runs the application in a single JVM without using Spark. Experiments were ran on the quad-core Intel i7 3630QM laptop jaws, which has 8 GB of memory, 500 GB hard drive, and is running Windows 8.1 with Java SE 7.

Cluster

cluster
Figure 5: Cluster configuration components.

Cluster mode runs the application on a Spark cluster in standalone mode. Experiments were ran on a network consisting of two laptop computers connected to a private 802.11n wireless network. In addition to jaws, the laptop oddjob was used. oddjob is a quad-core Intel i7 2630QM laptop with 6 GB of memory, 500 GB hard drive running Windows 7. Atop each machine, Oracle VM VirtualBox hosts two cloned Ubuntu 14.04 guest operating systems. Each virtual machine has two cores, 2 GB of memory and a 8 GB hard drive. Each virtual machine connects to the network using a bridged network adapter to its host’s. Host and guest operating systems are running Java SE 7, Apache Hadoop 2.6, and Scala 2.11.6 as prerequisites for Apache Spark 1.3.1. In total, there are four Spark Workers who report to a single Spark Master which is driven by a single Spark Driver.

Cloud

cloud
Figure 6: Cloud configuration components.

Cloud mode runs the application on an Amazon Web Services (AWS) provisioned Spark cluster managed by Apache Yarn. Experiments were ran using AWS’s Elastic Map Reduce (EMR) platform to provision the maximum allowable twenty[1] Elastic Compute Cloud (EC2) previous generation m1.medium instances (one master, nineteen core) by scheduling jobs to execute the application JARs from the Simple Storage Service (S3). Each m1.medium instance consists of one 2 GHz CPU, 3.7 GB of memory, 3.9 GB hard drive running Amazon Machine Image (AMI) 3.6 equipped with Red Hat 4.8, Java SE 7, Spark 1.3.0. In total, there are nineteen Spark Workers and one Spark Master – one per virtual machine – managed by a Yarn Resource Manager driven by a single Yarn Client hosting the application.

System Evaluation

Detection Rate

detection-rate
Figure 7: \overline{D}_r reported for flights in a single 25 km2 region. Larger \overline{D}_ris better.

\displaystyle D_R = \frac{\# \left( \text{Detected coordinates} \right)}{\# \left( \text{Expected coordinates} \right)} (6)

When an image is sparsely populated, the system consistently detects the presence of the aircraft, however as the density increases, the system is less adapt at finding all flights in the region as shown in Fig. (7). This is unexpected behavior. Explanations include the possibility that the threshold needs to be made to be adaptive, or that a different approach needs to be taken all together. In terms of real world implications, FAA regulations (JO 7110.65V 5-4-4) state that flights must maintain a minimum lateral distance of 3 and 5 miles (4.8 to 8 km). In principle, there could be at most four flights in a given image under these guidelines and the system would still have a 96.6% chance of identifying all four positions.

Detection Accuracy

den-ctl
Figure 8: Detected and actual positions of a fight from Denver to Charlotte.

A flight from DIA to CTL was simulated to measure how accurate the template matching approach works as illustrated in Fig. (8). Two errors were measured: the mean orthogonal distance given by Eqn. (5) and the mean distance between the detected and actual coordinate for all time steps:

\displaystyle MD(x, y) = \frac{1}{N} \sum_i \lVert  x_i - y_i \rVert_2 (7)

For Eqn. (5) a mean error of 9.99 \pm 3.2 m was found, and for Eqn. (7) 19.95 \pm 3.3 m. Both errors are acceptable given a single pixel represents approximately 10 m. (For context, the global positioning system (GPS) has a 7.8 m error.)

mod-dist
Figure 9: 65% of the paired trails and routes have a MOD less than 26 m. Smaller error is better.

In terms of how accurate detection is at the macro level, 500 flights were simulated and the resulting mean orthogonal distance was analyzed. Fig. (9) illustrates the bimodal distribution that was observed. 65% of the flights observed an accuracy less than 26 m with an average accuracy of 14.2 \pm 5.8 m, while the remaining 35% saw an average accuracy of 111.9 \pm 100 km which is effectively being off by a full degree. It is assumed that this 35% are cases where trails are paired incorrectly with routes. Based on these findings, the system enforces that all pairings not exceed a mean orthogonal distance of 25 m.

Tracking Rate

tracking-rate
Figure 10: 2^n for n \in [0, 10] national fights were simulated to completion with their mean tracking rate reported over 15 trials. Larger \overline{T}_R is better.

\displaystyle T_R = \frac{\# \left( \text{Correctly paired trails} \right)}{\# \left(\text{ Expected trails} \right)} (8)

For Fig. (10), an increasing number of random flights were simulated to completion and the resulting mean tracking rate reported. Based on these findings, the tracking module is having difficulty correctly handling many concurrent flights originating from different airports. This behavior is likely a byproduct of how quickly the detection rate degrades when many flights occupy a single region of space. When running a similar simulation where all flights originate from the same airport, the tracking rate is consistently near-perfect independent of the number of flights. This would suggest the system has difficulty with flights that cross paths. (For context, there is on average 7,000 concurrent flights over US airspace at any given time that we would wish to track.)

Performance

performance
Figure 11: Average processing time per image in milliseconds for the three dierent congurations. Smaller ms/image is better.

A series of experiments was conducted against the three configurations to measure how quickly the system could process different volumes of flights across the United States over a 24-hours period. The results are illustrated in Fig. (11). Unsurprisingly, the Cloud mode outperforms both the Standalone and Cluster modes by a considerable factor as the number of flights increases.

Configuration ms/image mb/sec Time (min)
Standalone 704 3.00 260
Cluster 670 2.84 222
Cloud 207 9.67 76
Table 1: Processing time per image, megabytes of image data processed per second, and overall processing time for 22k images by system configuration.

Table (1) lists the overall processing time for 22k images representing roughly 550k km2, and 43 GB of image data. If the Cloud configuration was used to monitor the entire United States, then it would need approximately 22 hours to process a single snapshot consisting of 770 GB of image data. Obviously, the processing time is inadequate to keep up with a recurring avalanche of data every fifteen minutes. To do so, a real-world system would need to be capable of processing an image every 2 ms. To achieve this 1) more instances could be added, 2) the implementation can be refined to be more efficient, 3) the implementation can leverage GPUs for detection, and 4) a custom tailored alternative to Spark could be used.

Discussion

Future Work

There are many opportunities to exchange the underlying modules with more robust techniques that both scale and are able to handle real-world satellite images. The intake and generation modules can be adapted to either generate more realistic flight paths and resulting satellite imagery, or adapted to handle real-world satellite imagery from a vendor such as Skybox Imaging, Planet Labs, or DigitalGlobe.

For detection, the correlation based approach can be replaced with a cross-correlation approach, or with the more involved Scale Invariant Feature Transformation (SIFT) method which would be more robust at handling aircraft of different sizes and orientations. Alternatively, the parallelism granularity can be changed so that the two-dimensional FFT row-wise and column-wise operations are distributed over the cluster permitting the processing of larger images.

Tracking remains an open issue for this system. Getting the detection rate to be near perfect will go a long way, but the age of historical sightings considered could be increased to account for “gaps” in the detection trail. Yilmaz et al. provide an exhaustive survey of deterministic and statistical point tracking methods that can be applied here, in particular the Joint Probability Data Association Filter (JPDAF) and Multiple Hypothesis Tracking (MHT) methods which are worth exploring further.

On the reporting end of the system, a dashboard showing all of the detected trails and coordinates would provide an accessible user interface to end-users to view and analyze flight trails, discover last known locations, and detect anomalies.

Real-world Feasibility

While the scope of this work has focused on system internals, it is important to recognize that a real-world system requires a supporting infrastructure of satellites, ground stations, computing resources, facilities and staff- each of which imposes its own set of limitations on the system. To evaluate the system’s feasibility, its expected cost is compared to the expected cost of the ADS-B approach.

Following the CubeSat model and a 1970 study by J. G. Walker, 25 satellites ($1M ea.) forming a constellation in low earth orbit is needed to provide continuous whole earth coverage for $25M. Ground stations ($120k ea.) can communicate with a satellite at a time bringing total costs to $50M.[2] Assuming that a single computer is responsible for square degree region, the system will require 64,800 virtual machines, equivalently 1,440 quad-core servers ($1k ea.) bringing the running total to $51M.

ADS-B costs are handled by aircraft owners. Average upgrade costs are $5k with prices varying by vendor and aircraft. Airports already have Universal Access Transceivers (UATs) to receive the ADS-B signals. FAA statistics list approximately 200,000 registered aircraft suggesting total cost of $1B.

Given that these are very rough estimates, an unobtrusive $51M system would be a good alternative to a $1B dollar exchange between private owners to ADS-B vendors. (Operational costs of the system were estimated to be $1.7M/year based on market rates for co-locations and staff salaries.)

Conclusions

In this work, a distributed system has been presented to detect and track commercial aircraft from synthetic satellite imagery. The system’s accuracy and detection rates are acceptable given established technologies. Given suitable hardware resources, it would be an effective tool in assisting search-and-rescue teams locate airplanes from historic satellite images. More work needs to be done to improve the system’s tracking abilities for it to be a viable real-world air traffic control system. While not implemented, the data needed to support flight deviation, flight collision detection and other air traffic control functionality is readily available. Spark is an effective tool for quickly distributing work to a cluster, but more specialized high performance computing approaches may yield better runtime performance.

References

[1] Automatic dependent surveillance-broadcast (ads-b) out equipment and use. Technical Report 14 CFR 91.225, U.S. Department of Transportation Federal Aviation Administration, May 2010.

[2] General aviation and air taxi activity survey. Technical Report Table 1.2, U.S. Department of Transportation Federal Aviation Administration, Sep 2014.

[3] Nanosats are go! The Economist, June 7 2014.

[4] A. Eisenberg. Microsatellites: What big eyes they have. New York Times, August 10 2013.

[5] A. Fasih. Machine vision. Lecture given at Transportation Informatics Group, ALPEN-ADRIA University of Klagenfurt, 2009.

[6] S. Kang, K. Kim, and K. Lee. Tablet application for satellite image processing on cloud computing platform In Geoscience and Remote Sensing Symposium (IGARSS), 2013 IEEE International, pages 1710-1712, July 2013.

[7] W. Lee, Y. Choi, K. Shon, and J. Kim. Fast distributed and parallel pre-processing on massive satellite data using grid computing. Journal of Central South University, 21(10):3850-3855, 2014.

[8] J. Lewis. Fast normalized cross-correlation. In Vision interface, volume 10, pages 120-123, 1995.

[9] W. Li, S. Xiang, H. Wang, and C. Pan. Robust airplane detection in satellite images. In Image Processing (ICIP), 2011 18th IEEE International conference on, pages 2821-2824, Sept 2011.

[10] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91-110, 2004.

[11] H. Prajapati and S. Vij. Analytical study of parallel and distributed image processing. In Image Information Processing (ICIIP), 2011 International Conference on, pages 1-6, Nov 2011.

[12] E. L. Ray. Air traffic organization policy. Technical Report Order JO 7110.65V, U.S. Department of Transportation Federal Aviation Administration, April 2014.

[13] J. Tunaley. Algorithms for ship detection and tracking using satellite imagery. In Geoscience and Remote Sensing Symposium, 2004. IGARSS ’04. Proceedings. 2004 IEEE International, volume 3, pages 1804-1807, Sept 2004.

[14] J. G. Walker. Circular orbit patterns providing continuous whole earth coverage. Technical report, DTIC Document, 1970.

[15] Y. Yan and L. Huang. Large-scale image processing research cloud. In CLOUD COMPUTING 2014, The Fifth International Conference on Cloud Computing, GRIDs, and Virtualization, pages 88-93, 2014.

[16] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey. Acm computing surveys (CSUR), 38(4):13, 2006.

Graduate School

leave a comment »

“Stuck. What it is, I think there is a jump some people have to make, sometimes, and if they don’t do it, then they’re stuck good … And Rudy never did it.”
“Like my father wanting to get me out of Maas? Is that a jump?”
“No. Some jumps you have to decide on for yourself. Just figure there’s something better waiting for you somewhere….” He paused, feeling suddenly ridiculous, and bit into the sandwich.

William Gibson, Count Zero [1]

After spending a fair amount of time in industry, I’ve decided its fair time to change directions, to make a jump. This month I’ll be returning to school to work on my master’s degree in Computer Science. Based on what makes me happy, and where I’d like to take my career, it’s the right path and the right time to make this change. I’m looking forward to making the most of it.

As far as what it means for the site, it’s tough to say. August 8th was the six year anniversary of the site and after writing and publishing my interests for so long, it makes sense to continue to do so, but on the same hand, time is precious. So we’ll see. As always, take a look through the archive, follow by email, or by RSS for the latest happenings.

Written by lewellen

2014-08-18 at 2:34 pm

Posted in Announcements

3D Printed Toy Robot

with one comment

Left-to-right: Original pixel art illustration, nearest neighbor extrapolation, vectorized version and 3D CAD model.
design-progression


Introduction

At the beginning of the year I decided to change my approach to working on personal projects; in short, more time on quality and less on quantity. With that mindset I thought about my interests and how they could come together to form an interdisciplinary project that would be ambitious, but doable. After much thought, I zeroed in on mechatronics -the interplay of form, computation, electronics and mechanics- and began to explore how I could draw from these disciplines to form the foundation of my next project.

Like most people coming from a software background, much of what I work on is abstract in nature. Every once in a while I have a strong desire to work on something tangible. Lacking the proper space and tools to work with traditional mediums like wood and metal, I thought about newer mediums I’d read about and directed my attention to 3D printing. Despite being 30 years old, the technology has only become accessible to consumers within the past seven years with the launch of several online services and open source printers. Having watched this growth, I was curious about the technology and decided it would be worth exploring in this project.

I’d been looking for an excuse to get back into electronics and work on a project that would require me to delve deeper into analog circuit design. This desire was drawn from the belief that I ought to be able to reason about computation regardless of how it is represented- be it by mathematics, software or hardware. This of course meant avoiding microcontrollers; my goal here was to better learn the foundations of the discipline, not make something quickly and cheaply.

To stay true to the mechatronics concept, I decided I would incorporate mechanical elements in to the project. Having zero knowledge of mechanics, I knew whatever I’d make would be simplistic and not much to write about. However, I felt that whatever I decided to do, I wanted the mechanically driven functionality to ignite a sense of fun in whoever was interacting with the end result. After all, the end result is something you might only interact with briefly, but it should be memorable and what better way to achieve that than to make it entertaining.

I searched for inspiration to figure out how I would tie all of these disciplines together. Growing up I read a lot of Sci-fi literature and a lot of that material stuck with me over the years, especially the portrayal of robots from the golden age. Above all, it captured the sense that anything was possible with technology and served as a great source of inspiration in my career. So without much surprise, I found the inspiration I needed for the project in my robot avatar. My mind was flickering with ideas as to how I could bring that simple icon to life and with that excitement I began my work.

Over the course of eight months I taught myself the necessary disciplines and skills to carry out my vision of building a simple 3D printed toy robot. Having finished my work at the end of July, I began compiling my notes and documenting my work and concluded this writing in the middle of October. This post is the culmination of that work and it will cover my technical work, thinking and experience. Above all, this is my story of bringing an idea to fruition while digging into the world of traditional engineering.

Outline

The scope of the project is fairly broad, so naturally, there’s a lot that I’m going to cover in this post. To help set some expectations on what you’ll encounter, I’m going to cover the three main stages of the project consisting of planning, building and evaluating; each covering the technical aspects related to industrial design, electronics and mechanics. In the planning section you can expect to read about the requirements, project plan and design of the product. In the building section, I’m going to cover the process of sourcing materials, prototyping and building the finished product. I’ll be wrapping up with a section dedicated to how the product was tested and some reflections on how I would approach the project again given the experience.

Related Work

While I was in the midst of doing some brainstorming last December I came across a series of YouTube videos by Jaimie Mantzel covering his experience building 3D printed toy hexapods. His “3D printed big robot” series really captured his enthusiasm and energy for the medium and while I wasn’t going to be making something as complicated as a hexapod, it was helpful to get a window into someone else’s mind and see how they approach the problem of making a product from scratch. Hopefully this document will inspire others to get out there and make something as much as these videos inspired me to do the same.

Planning Phase

Requirements

I designed my avatar about ten years ago and in reflecting on how I illustrated it at the time I began to think about what it would do in the real world. For starters, I envisioned an illuminated eye and chest opening with the later glowing on and off. The second thing I envisioned was that the arms would be able to swing back and forth opposite to one another giving the illusion of body language depending on the initial position of the arms: straight down giving the impression of marching, forward as though it were in a zombie state, all the way up as though in a panic and flung backward as though it were running around frantically. In short a variety of different personas.

Implicit to these behaviors are some underlying requirements. The toy requires power, so it makes sense to provide an on-off button somewhere on the design that is easy at access while the arms are in motion. Since the arms are in motion, it’s begs the question of for how long are the arms swinging in each direction and how quickly. Finally there is the frequency at which the chest lights pulsate on and off and with what type of waveform. The physical enclosure would be required to enclose all of the electronics and hardware components and also provide a means of assembling the parts within the enclosure and then providing a mechanism for securely closing the enclosure.

Since we are talking about engineering it makes sense to have a handful of engineering centric requirements. From an electronics engineering point of view, the components must be within their stated current and voltage tolerances as the power supply diminishes over time. From a mechanical point of view, the motor must supply sufficient torque to the mechanical system to ensure efficient transfer of motion to the arms so that the movement remains fluid, plus the product would need to be stable under static and dynamic loads. Finally, from an industrial design point of view, the product itself must be sturdy enough to withstand the forces that will be applied to it and be designed in a way that the fabrication process will yield a quality result.

Ultimately, we are talking about making a consumer product. That product needs to be polished in its appearance and stay true to the original avatar illustration. The product needs to be conscious of its life cycle as well in terms of being made from environmentally friendly components and practices, to being efficient in its use of energy and being capable of being recycled at the end of its life. As is the case with any project, there are a number of more nuanced requirements imposed by each of the components and tools used in the process of creating the product. I will discuss along the way how these constraints affected the project.

Methodology

Thinking about how to approach the project, I thought about how I’d approached projects of similar size and scope. My approach typically follows the iterative and incremental development methodology; do a lot of up front planning, cycle through some iterations and wrap up with a deployment. Rinse and repeat.

Center: Project activities and outputs (bold) for each stage consisting of initial planning (green), development iteration (blue), and deployment (maroon) stages.
methodology


In the introduction I talked about my thought process and motivation that constituted the brainstorming that was done to generate the requirements. From these requirements, I broke the problem down into smaller more digestible components that I could research and learn the governing models that dictate how those components behave from a mathematical point of view and on the computer through simulations. With that understanding, it was possible to put together a design that fulfills a specific set of conditions that result in the desired set of outputs. This design is the cornerstone of the initial planning.

The bulk of work is based in working through iterations that ultimately result in working components based on the design that was produced in the planning stage. Each iteration corresponds to a single component/requirement that starts off with building a prototype. For electronics this is done by prototyping on a breadboard, mechanics and industrial design with cardboard elements. The result is a functioning prototype that is used to create a refined design for the component that is then built. For example, a soldered protoboard or 3D printed volume. A completed component is tested and if accepted, stored for assembly.

As accepted components are completed, they are assembled into the broader view of the product. Each component is tested as part of the whole and if working correctly, fastened into place. Once all of the components are incorporated, the end product is tested and ultimately accepted. Throughout the process, if a particular output for an activity fails, then the previous activity is revisited until the root cause is identified, corrected and the process moves forward to completion.

Design

Industrial Design

Part Design

Knowing I didn’t know what I didn’t know about how to make a durable product, I did some research about part design and came across a set of design guidelines from Bayer Material Sciences covering the type of features that are added to products to make them more resilient to the kind of abuse consumer products are subjected to. This was helpful since it gave me a set of primitives to work with and in this project I ended up using a few of these features for providing strength to the product- ribs, gussets and trusses- and a few for allowing parts to be mated together- counterbores, taps and bosses.

Left-to-right: Part design features used in the design of the product consisting of ribs, gussets, counterbores and bosses.
part-design


Ribs are used whenever there is a need to reduce the thickness of a plate and retain the plate’s original strength. When the thickness of the plate is reduced, so is the volume, and hence the cost but so too is the plate’s strength. To compensate, a rib is placed along the center of the plate and this increases the plate’s strength to its original state. A similar approach is deployed when talking about gussets as a way of providing added strength between two orthogonal faces. While not depicted above, the last technique deployed were trusses. Like what you’d find in the frames of your roof, the truss is a series of slats that form a triangular mesh that redistribute forces placed on the frame allowing it to take on greater load.

Counterbores are used so that the top of a machine screw sits flush or below the surface that needs to be fastened to another. The machine screws are then fastened to either a tap or a boss. A tap is a bored out section of material that is then threaded to act like a nut. Since we want to minimize material, the alternative is to carve out the material around the bore and leave a standing post that the machine screw will fasten to. Since the boss is freestanding, it may have gussets placed around its perimeter depending on the boss’s height.

Modeling

The first step in getting the 3D model put together was to use the pixel art version of my avatar to create a vectorized version for the purpose of developing a system of units and measurements that I could then use to develop a 3D model of the product in OpenSCAD. After having a chance to learn the software and its limitations, I set out to create a series of conventions to make it easy to work with parts, assemblies, enclosures in a variety of units and scales. Since real world elements would factor into the design of the product, I included all of the miscellaneous parts into the CAD model as well to make it easier to reason about the end result. After about a month of work the model was completed, let’s take a look at how it all came together.


The product consists of two arms, body, bracket and back plate. Each arm consists of a rounded shoulder with openings for fastening shaft collars; the shaft collars themselves and the shaft that extends from the body. The extended arm is hollow with interior ribs to add strength to the printed result. To support the printing process, small holes are added on the underside so that supporting material can be blown out.

On the exterior face of the back plate, the embossed designer logo, copyright and production number are provided to identify the product. Counterbores allow machines screws to be fastened to the main body at the base of legs and back of the head. On the interior face, the electronic components are fastened to bosses using machine screws. Since the back plate is just a thin plate, ribs were added to increase that plate’s strength.

The body has an opening on the top of the head for the on-off switch, and the face an opening for the eye plate and on the chest an opening for the chest plate. Counterbores line the hips so that machine screws can be fastened to the bracket to hold it in place. On the interior, horizontal bosses mate with counterbores on the back plate and several trusses and ribs were added to provide strength.

Left-to-right: Symbols and details consisting of the resin identification code (RIC) for polyamide 2200, waste electrical and electronic equipment (WEEE) to promote recycling, my personal logo with production number and identifying information.
symbols


A shelf separates the eye cavity from the chest cavity so that illumination from one doesn’t interfere with the other. Under the interior of the shoulders are the RIC for the material, WEEE logo, and along the leg, text describing the product with title, designer, date and copyright.

Finally, the bracket consists of two flanges with openings for ball bearings and a main shelf with an opening for the motor and set screws to secure the motor. The bracket has four taps that mate with the counterbores along the hip of the body.

Analysis

While I had a set of guidelines on how to make the product durable, most of the modeling process was more art than engineering. Since the 3D printing process would be fairly expensive, I wanted some additional confidence that the result was going to come out sturdy, so I thought a bit about it from a physics point of view. I knew of stress-strain analysis and that to understand how a complex object responds to loads I would need to use a Finite Element Analysis (FEA) solver on the 3D model. I researched the subject a bit, and once I felt I had a working understanding, I decided to use CalculiX to perform the actual computations. This section will cover the background, process and results at a very high level.

Stress-Strain Analysis

When a physical body is put under load, the particles of the body exert forces upon one another; the extent of these internal forces is quantified by stress. When the load begins to increase beyond the material’s ability to cope, the particles may begin to displace; how much they move about is quantified by strain.

Center: Stress-strain curve.
stress-strain


For small amounts of strain the material will behave elastically; returning to its original form once any applied forces are removed. As the amount of strain increases, and consequently the stress past the yield stress, the material begins to behave plastically; first hardening and then beginning to neck (i.e., thinning and separating) until it finally fractures. The specific critical values will depend entirely on the material and in this project an isotropic material, Polyamide PA 2200 (Nylon 12), will be used in the 3D printing process.

For a durable product, we obviously want to avoid subjecting the body to any forces that will result in permanent deformation which means keeping stress below the yield stress and even lower from an engineering point of view. The acceptable upper bound is given by a safety factor that linearly reduces the yield stress to an acceptable level. According to the literature a safety factor in the range N \in [1.1, 1.5] is appropriate.

To establish a design criteria, I went with a safety factor of N = 1.5. The material datasheet, omitted the precise yield value, but upon further research, the average yield stress for Nylon 12 was reported to be \sigma_Y = 33 \text{ MPa}. Thus, an upper bound of \sigma_{N} = 22 \text{ MPa} will be used as an acceptable level of stress- which is on the order of about a five hundred pound object resting on a square centimeter of area. While that may make the idea of performing the full analysis overkill, it’s still valuable from an intuition building point of view.

Linear Stress-Strain Relationship

Given that we want to stay within the linear region, we can now begin to look at the specific relationship between stress, \sigma, and strain, \varepsilon, for a three dimensional point which is given by a system of partial differential equations subject to equilibrium and compatibility conditions, fixed (Dirichlet) and load bearing (Neumann) boundary conditions:

\sigma C = \varepsilon


Both stress and strain are second-order tensors that relate how each of the three bases of a coordinate system relates to one another. When acting in the same dimension, the result is normal stress, \sigma_{xx}, and strain, \varepsilon_{yy}, otherwise the result is shear stress, \tau_{xy}, and (engineering) shear strain, \gamma_{xz}. The strain terms are partial derivatives of the displacement field, u(\vec{x}), to be determined and stress terms are constants. C is the elasticity matrix that relates the two.

Since we are working with an isotropic material, the resulting equations can be simplified to the following:

\displaystyle \begin{pmatrix} \varepsilon_{xx} \\ \varepsilon_{yy} \\ \varepsilon_{zz} \end{pmatrix} = \frac{1}{E} \begin{pmatrix} 1 && -\nu && -\nu \\ -\nu && 1 && -\nu \\ -\nu && -\nu && 1 \end{pmatrix} \begin{pmatrix} \sigma_{xx} \\ \sigma_{yy} \\ \sigma_{zz} \end{pmatrix}

\displaystyle \gamma_{xy} = \frac{\tau_{xy}}{G} \quad \gamma_{yz} = \frac{\tau_{yz}}{G} \quad \gamma_{zx} = \frac{\tau_{zx}}{G}


Characteristic Description Polyamide 2200 Values
Young’s modulus Slope of the stress-strain curve in the linear region. E = 1.7 \text{ GPa}
Poisson’s ratio Measure of how much a material will reduce as a consequence of being stretched. \nu = 0.4
Shear modulus Function of the previous two and quantifies how force is needed before the material begins to shear. G = \frac{E}{2(1+\nu)} = 0.61


Finite Element Method (FEM)

To determine the stress and strain of a body under load we’ll rely on the FEM, a general purpose numerical method for calculating the approximate solution to a boundary value problem whose analytic solution is elusive. The general idea is that the problem domain can be broken into smaller elements, each analyzed individually, and the corresponding analyses combined so that an overarching solution can be determined.

Center: Example domain with Dirichlet and Neumann conditions converted into a mesh consisting of linear triangular elements, Dirichlet nodes and Neumann edges.
triangulation


In the discretization step, the exact elements depend on the nature of the problem being solved, and in the case of stress analysis, into (linear or quadratic) tetrahedral elements consisting of nodes, edges and faces. In general, the nodes are being subjected to (unknown) internal and (known) external forces and the displacements of the nodes are to be determined. The relationship between the two is assumed to be linear. Thus, the following equation characterizes the state of the physical model at the element.

\vec{f} =\textbf{k} \vec{u}


Where k is the stiffness matrix, \vec{u} is the displacement vector and \vec{f} is the force vector (also known as the load vector). Fixed boundary conditions are defined in the displacement vector and load boundary conditions in the force vector when necessary. After defining the relationship at each node, the assembly process aggregates each statement into the global stiffness matrix which represents the state of the full physical model.

\vec{F} = \textbf{K} \vec{U}


Standard algorithms such as Gaussian Elimination and the Gauss-Seidel Method can be used to solve for the displacement vector. With the displacement vector, stress and strain can then be computed at the nodes and interpolated across each element to produce the desired stress analysis. As the number elements increases, the error between the exact and approximate solution will decrease. This means that the FEM solution serves as lower bound to the actual solution.

Tool Chain
Left-to-right: Interchange formats between applications.
fea-workflow


Given the theory, it’s time to assemble the tools to carry out the job. The abundance of free computer aided design (CAD) and engineering (CEA) software made it easy to narrow down the options to OpenSCAD to define the geometry, MeshLab to clean the geometry, Netgen to mesh and define boundary conditions, and finally CalculiX to perform the necessary calculations and visualize the analysis. While each tool is excellent at carrying out their technical tasks, they are less polished from a usability point of view so additional time was spent digging under the hood to deduce expected behavior. Given the extra time involved, it’s worth covering the steps taken to obtain the end results.

The geometry of the product was modeled in OpenSCAD using its built-in scripting language. To share the geometry with Netgen, it is exported as a Standard Tessellation Language (STL) file which consists of a lattice of triangular faces and their corresponding normal vectors. Sometimes I found the exported STL file from OpenSCAD would cause Netgen to have a hard time interpreting the results, so MeshLab was used to pre-process the STL file and then hand it off to Netgen.

Netgen is used to load the geometry, generate the mesh and define boundary conditions. Once the geometry is loaded and verified with the built-in STL doctor, the engine is configured to generate a mesh and the process (based on the Delaunay algorithm) is carried out. Once the resulting mesh is available, the faces that correspond to the boundary conditions are assigned identifiers so they can be easily identified in CalculiX. The end result is exported a neutral volume file that CalculiX will be able to work with.

CalculiX consists of two components: GraphiX (CGX) and CrunchiX (CXX). CGX is used as a pre-processor to export the mesh in a format (MSH) that CCX can easily interpret, and specify and export the boundary conditions consisting of the fixed surfaces (NAM) and load conditions (DLO). CCX takes a hand written INP file that relates the material properties, mesh, and boundary conditions to the type of analysis to perform and then CCX carries out that analysis and outputs a series of files, most notably, the FRD file. CGX is then used as a post-processor to visualize the resulting stress and deformation results.

Finite Element Analysis (FEA)

The completed product will have pressure applied to the top of the head whenever the pushbutton is pushed to turn the product on or off; based on the design criteria, a uniform static pressure of 22 \text{ MPa} will be applied across the top of the head for the load boundary condition. It will be assumed that the product is sitting on a surface such that the bases of the feet are fixed producing the remaining boundary conditions.

Six configurations were run in order to evaluate the designs consisting of three mesh granularities against the model before and after strengthening enhancements were introduced. For the purpose of the analysis, it is assumed that the eye, chest and back plate are fixed to the body.

Left-to-right: Displacement before and after, and Von Mises stress before and after for the “Very Fine” granularity.
correct-fea


Looking at the plots, the maximum displacement is centered about the through hole for the pushbutton which is consistent with one’s intuition. It appears that stress is most concentrated along the front where the head meets the shoulders and along the rim of the through hole for the pushbutton which matches up with the literature. After enhancements were introduced, it appears that displacement was reduced and that stress became more concentrated. Quantitatively:

(Enhancement) (Mesh Size) Stress (MPa) Displacement (mm)
Min Max Min Max
Before Moderate 0.0166 1.27 0 0.519
Fine 0.0109 1.83 0 0.888
Very Fine 0.00151 5.99 0 2.01
After Moderate 0.000648 2.5 0 0.631
Fine 0.000348 5.06 0 1.11
Very Fine 0.000148 11.2 0 1.75


After collecting the extremum from the test cases, the stress results seemed spurious to me. It didn’t feel intuitively right that applying a large amount of pressure to the head would result in at most a half of that pressure appearing throughout the body. I am assuming that there was a mix-up in the order magnitude of the units somewhere along the way. If the calculations are correct, then everything will be well below the yield stress and the product will be able to support what seems like a lot of stress. On the other hand, if the numbers are wrong, then I can’t make any claims other than relative improvements in the before and after; which was the main reason for carrying out the analysis.

From the literature, a finer mesh (i.e., a greater number of nodes) results in more accurate results. Looking at the outcome of the “Very Fine” mesh shows that the maximum stress in the product after enhancements is greater than the maximum stress before changes. However, the product is more rigid after the enhancement and admitted a reduced maximum displacement. This seems like an acceptable tradeoff since the goal was to make a product sturdier through enhancements.

Mechanics

Construction
Left-to-right: Technical drawing of the front and side views of the mechanical system. Actual gears will have an involute profile.
mechanical-drawing


The next step in completing the design of the product was to focus on how to make a machine to rotate the arms. Not knowing a whole lot, the first step was to read up on the established elements and to think how they could be used to fulfill my needs. Out of this reading I came across the differential and how it was configured to rotate shafts at different speeds. Deconstructing the mechanism down to its primitive components enabled me to see how I could borrow from that design to come up with my own.

Gearmotor

At the heart of the mechanical system is a 100:1 (gear ratio) gearmotor consisting of a direct current (DC) motor and a gear train. For every rotation of the output shaft, the motor’s shaft rotates 100 times. This is achieved through the arrangement of gears in the gear train attached to the DC motor’s shaft. Since torque is proportional to the gear ratio, this means we get greater mechanical advantage.

Gear Train

The gearmotor is used to drive a gear train consisting of three gears: a pinion gear attached to the shaft of the gearmotor and two bevel gears that are perpendicular to the pinion. As the pinion gear rotates clockwise (resp. counterclockwise), one of the bevel gears will rotate clockwise (resp. counterclockwise) and the other will rotate counterclockwise (resp. clockwise). The pinion gear and bevel gears that will be used are in ratio of 1.3:1.

Shaft Assembly

To fix elements to the shaft, shaft collars are used to pinch the element into place along the shaft and each shaft collar is then fastened to the shaft using screw sets to retain the position and inward pressure on the element. Each bevel gear is connected to a shaft that extends outward to an arm. The main elements fastened using the sandwiching technique are the bevel gears, radial ball bearings that mesh with the bracket and body, and the arm. Outside the body, a spacer is used to displace the arm from the body and openings exist in the shoulder of the arm to access the shaft collars used to secure the arm.

Bracket

A modified clevis bracket will be used to hold the mechanical elements in place. The bracket consists of an opening in the middle to house the gearmotor along with two set screws openings to fasten it in place. To allow the bracket to be fitted with the body, machine screws are tapped along the edges of the bracket. Along the flanges of the bracket are openings for the radial ball bearings and each flange are supported by gussets to ensure that they don’t easily break off.

Analysis

Since we want to control the output of the arm as much as possible, we’ll want to explore how the arms behave at different orientations. To do so, we’ll assume that no external forces (with the exception of gravity) are acting on the arm and that the motor we are using a permanent magnet, direct current motor with a gearbox, specifically the KM-12FN20-100-06120 from the Shenzhen Kinmore Motor Co., Ltd. There are three views of the model that we’ll take into account: the arm, and the motor’s physical and electrical characteristics. Each view of the model will be presented, and then used to answer two questions (1) are there any orientations of the arm that cannot be supported by the motor and (2) are there any orientations that the motor cannot accelerate up to in order to achieve a steady angular velocity.

Arm
Center: Free body diagram of the arm. The inner most circle is an opening for the shaft and the next inner more circle is an opening for the shaft collar.
physics


While the geometry of the arm is more complex than a standard geometric primitive, it will be modeled as a cuboid with square edge length of, H = 1 \text{ in} and projected length L = 3.5 \text{ in}. The wall thickness, wt = 3/32 \text{ in}, gives a total approximate volume of V = 0.704 \text{ in}^3. According to the material’s datasheet, the density of the printed material is \delta = 0.93 \text{ g}/\text{cm}^3 giving us an approximate mass of m = 11.5 \text{ g}.

Since we are talking about rotating the arm, the moment of inertia will come into play. We’ll use the standard formula for a cuboid, superposition principle to account for the hollow interior and parallel axis theorem to deal with the pivot about the elbow in order to come up with the actual moment of inertia for the model, M_{\text{Arm}} = 214 \text{ g cm}^2.

Physics

We have a series of torques being applied to the arm: first from the motor itself, \tau_m, the viscous friction, \tau_f, and finally from gravity, \tau_g.

The torque from the motor, \tau_m = K_m I, is proportional to the current, I, that is applied with respect to the motor’s torque constant, K_m. The torque constant can be calculated by taking the quotient of the stall torque and current. For the motor that will be used, that values comes out to be K_m = 0.069 \text{ Nm/A}.

The viscous friction, \tau_f = K_f \omega, is proportional to the speed at which the motor is rotating and the motor’s viscous friction constant, K_f. The datasheet doesn’t include this value, so a near zero value was chosen based on properties of similar motors.

Torque due to gravity, \tau_g = \lVert F_g(\theta) \times \Delta(\theta) \rVert, is a function of the force due to gravity, F_g(\theta) and the orientation of the arm, \theta. Assuming that gravity and friction are acting in opposition of the motor, the net torque is the sum of these torques giving:

\displaystyle M_{\text{Arm}} \frac{d^2}{dt^2} \theta = - m g \Delta \sin(\theta) - K_f \frac{d}{dt} \theta + K_m I


Electronics

The ideal electronic model of the motor system is a series circuit consisting of the voltage, V, applied to the terminals of the motor, the motor’s internal resistance, R, due to the coils, resulting inductance, L, and the back electromagnetic field, V_b, generated by the motor.

Center: Electronic model of an ideal DC motor. The node between the voltage and resistor will indicate the reference node in all electrical schematics.
motor-model


By Kirchhoff’s voltage law, the voltage applied to the terminals is equal to the sum of the potentials across each of the components giving:

\displaystyle V = L\frac{d}{dt}I + R I + K_b \frac{d}{dt} \theta


The three constants in the equation, L, R, K_b are all unknown with the exception of the motor voltage constant, K_b, which is just the same as the motor torque constant K_m.

Since the resistance and inductance are not listed on the datasheet for the motor and could not be located online, we’d have to purchase the motor and then experimental determine their values. To complete the analysis, we’ll make simplify assumptions to work around this limitation.

Completed Model

Rewriting the physical and electronic governing equations in terms of the angular velocity, \omega, we end up with a system of inhomogeneous linear ordinary differential equations which can be solved using the technique of variation of parameters.

\displaystyle \frac{d}{dt}\underbrace{\begin{pmatrix} I \\ \omega \end{pmatrix}}_{\vec{x}} = \underbrace{\begin{pmatrix} -\frac{R}{L} && -\frac{K_b}{L} \\ \frac{K_m}{M_{\text{Arm}}} && -\frac{K_f}{M_{\text{Arm}}} \end{pmatrix}}_{\textbf{A}} \begin{pmatrix} I \\ \omega \end{pmatrix}  + \displaystyle \underbrace{\begin{pmatrix} \frac{V}{L} \\ -\frac{m g \Delta }{M_{\text{Arm}}} \end{pmatrix}}_{\vec{b}} \to \frac{d}{dx} \vec{x} = \textbf{A} \vec{x} + \vec{b}


In doing so, we’d uncover nonlinear transient behavior and steady state fixed values that both current and angular velocity will approach in the limiting case.

Now that we have a complete model, we’ll make some simplifying assumptions in order to resolve questions (1) and (2). First, we’ll assume that the viscous friction constant is several orders magnitude smaller than the other terms and can be set to zero. Second, we’ll assume that the motor has accelerated up to a fixed angular velocity. Finally, we’ll assume the motor will always be supporting the worst case load from the arm. Under those assumptions we arrive at:

\displaystyle \begin{pmatrix} V \\ m g \Delta \end{pmatrix} = \begin{pmatrix} R && K_b \\ K_m && 0 \end{pmatrix} \begin{pmatrix} I \\ \omega \end{pmatrix}


To approach question (2), let’s think about the current side of the system. The maximum current takes place when the motor shaft is not rotating. Using an ohmmeter to determine the terminal resistance, R = 20.6 \Omega, the maximum current works out to be I_{\text{Max}} = 291 \text{ mA} which is higher than the rated current of I_{\text{Rated}} = 135 \text{ mA} but less than the stall current of I_{\text{Stall}} = 420 \text{ mA}. From a steady state point of view the motor will operate within the specified bounds.

However, when we go to reverse the motor, we’ll introduce a drop from the steady state speed down to zero and then ramp back up in the opposite direction. So in reality, we may observe current changes on the order of twice that of I_{\text{Max}} which will push us outside the stated boundaries so we’ll need to ensure that we operate a voltage no higher than V_{\text{Max}} \le 4.32 \text{ V}. Provided we keep the voltage below 72 \% the rated voltage of 6 \text{ V}, sufficient current will be supplied to motor and it will rotate up to a constant angular velocity and thus, obtain any desired orientation.

Electronics

The final step in completing the design was to focus on how the electronics would illuminate the eye and chest cavity and drive the rotation of the arms. Being the main focus of the project, I will be covering how the electronics fulfill these requirements in a bit more depth than the two prior sections.

Motor Control

The motor control subsystem is responsible for managing how frequently and how quickly the motor needs to rotate in each direction. The subsystem consists of a timing circuit determining how frequently the motor will rotate, a pulse width modulated circuit to determine how quickly, a Boolean logic circuit to form composite signals that will be feed to a motor driver, and an H-Bridge circuit serving as the motor driver used to drive the gearmotor.

Series Resistor-Capacitor (RC) Circuits

To understand the basis for timing we’ll need to discuss series resistor-capacitor circuits briefly. Assume a series circuit consisting of a voltage source, V_{\text{cc}} (Volts), resistor with resistance R (Ohms), and a capacitor with capacitance C (Farads). There are two scenarios to consider, the first in which the capacitor is fully discharged and second when it is fully charged.

rc-circuit-charging

In the first, the capacitor will begin to charge allowing some of the current to pass through and then, once fully charged, it have such impedance that no current will flow through. Kirchhoff’s voltage law says that the voltage across the resistor and capacitor is equal to the supply voltage.

V_{cc} = V_R(t) + V_C(t)


Since there is one path for current to flow through the entire circuit, the amount of current flowing through the resistor is the same as the capacitor. Using Ohm’s law and the definition of capacitance and current we arrive at an initial value problem consisting of a linear ordinary differential equation that can be solved using the technique of integrating factors.

\displaystyle \frac{V_{cc}}{ RC} = \frac{d}{dt}V_C(t) + \frac{1}{RC}V_C(t)


Solving for equation we get a curve that grows with exponential decay and in the limiting case, approaches V_{cc}. For mathematical simplicity, let \tau = 1/RC be the timing constant.

\displaystyle V_{C}(t) = V_{cc} \left ( 1 - e^{-\tau t} \right )


rc-circuit-discharging

In the case of the discharging capacitor, Kirchhoff’s voltage law says the voltage of the capacitor is equal in magnitude to the voltage across the resistor.

V_C(t) = -V_R(t)


Making the same assumptions as before, we arrive at another initial value problem consisting of a linear ordinary differential equation that can be solved using its characteristic equation.

\displaystyle 0 = \frac{d}{dt}V_C(t) + \frac{V_C(t)}{RC}


As a result we get a curve that declines with exponential decay that will eventually approach zero in the limiting case. Using the timing constant again, we arrive at the following curve.

\displaystyle V_C(t) = V_{cc} e^{-\tau t}


Timing
Center: 50% Duty Cycle 555 based Circuit. Source: Adapted from National Semiconductor LM555 datasheet Page 10, Figure 14.
555


In order to tell the motor how long it needs to be on in one direction, a 555 integrated circuit is used to generate a square waveform of a fixed frequency. This is done by chaining a RC circuit with the inputs of the integrated circuit in astable mode to generate a 50% duty cycle waveform. The duty cycle is a way of measuring what percent the resulting square wave will spend in a high state relative to the period of the waveform. Here, we’ll have equal parts high state and low state each period.

Center: Illustration of the 555’s output as the capacitor voltage charges and discharges after hitting trigger and threshold values.
555-output


When an input voltage of one third that of the reference is applied to the trigger input, the output of the integrated circuit will be the same as the reference voltage. When two thirds the reference voltage is applied to the threshold input, the output voltage goes to ground. In this set up, the capacitor in the circuit will be continuously charging until the ceiling has been hit and then discharge until the floor and so on.

Characteristic Time (seconds)
Initial ramp up time \ln(3) R_1 C
Discharge times \displaystyle t_1 = \left ( \frac{R_1 R_2}{R_1 + R_2} \right ) C \ln{\left( \frac{R_2 - 2R_1}{2R_2-R_1} \right)}
Subsequent charge times t_2 = \ln(2) R_1 C
Period T = t_1 + t_2
Frequency \displaystyle f = \frac{1}{T}
Duty cycle \displaystyle D = \frac{ t_1  }{ T}


To calculate the values for R_1, R_2 \text{ and } C, I wrote a program to explore the combinations of standard resistor and capacitor values, then narrowed it down to those combinations that would give nearly the desired duty cycle and a period of about a few seconds. Values of 4.7 \text{ k} \Omega, 2 \text{ k} \Omega, \text{ and } 100 \text{ } \mu F for the variables gave a period of roughly 6.5 seconds and a duty cycle of 49.6\% providing the closest fit.

Pulse Width Modulation (PWM)
Center: PWM Circuit. Source: Adapted from Afrotechmods.com PWM Tutorial.
pwm


In order to control how fast the motor rotates, we’ll take advantage of the fact that the motor’s speed is linearly proportional to the voltage supplied to the motor and that the average output voltage of a PWM signal is linearly proportional to its duty cycle. To realize that plan, a 555 integrated circuit in astable mode is used to generate a high frequency signal whose duty cycle is controlled by the use of a potentiometer.

The fundamental operation of the 555 integrated circuit remains unchanged, however, the PWM circuit has a different topology than the timing circuit’s and as a result, has different timing characteristics. When the circuit is initially charging, current will go through R_1, the bottom diode, R_2 and C_1 until the threshold values bound has been reached. Then, during the discharge phase, current will then travel through the complementary side R_2^{\prime} of the potentiometer, the top diode to the discharge pin until the trigger voltage has been reached.

Using values of R_1 = 330 \Omega, R_2 = 0 - 100 \text{ k}\Omega and C_1 = 0.1 \mu \text{F} we find the following:

Characteristic Time (seconds)
Initial ramp up time \ln(3) (R_1 + R_2) C_1
Discharge times t_1 = ln(2) R_2^{\prime} C_1
Subsequent charge times t_2 = \ln(2) (R_1 + R_2) C_1


Thus, we’ll end up with a circuit running at 144 \text{ Hz} with a duty cycle range of 0.005\% - 99.7\% giving a very broad range of voltage values that can be chosen at run time.

NOT-Gates and Resistor-Transistor Logic (RTL)
Center: RTL NOT-Gate using 2N3904 NPN Transistor
rtl-not-gate


To begin talking about the logic used to combine the PWM and timing signals, we’ll need to perform negation of a signal, A, into \bar{A}. The simplest such way is to use a RTL based NOT-Gate as depicted above. Assuming V_cc = 5\text{ V} is logical true, and zero volts is logical false, then when A = T, then we’ll switch the transistor on so that \bar{A} = F, and vice versa when A = F to switch it off so that \bar{A} = T.

To determine the values of the resistors, we’ll need to look at the low-frequency, large signal model of the transistor which consists of three states: cut-off, active-linear and saturation. For the logic gate, we’ll want to minimize the active-linear region and focus on flip-flopping between the cut-off and saturation regions based on the value of A.

For the output to be logically true, V_{cc} = 5 \text{ V}, the input voltage must be less than or equal to the saturation constant between the base and emitter, V_L = V_{\text{BE(SAT)}} = 0.65 \text {V}. This condition will put the switch into a cut-off state.

If however, the input voltage is greater than V_{\text{BE(SAT)}}, then the transistor will be in saturation mode and from Kirchhoff’s current laws we’ll see that the current for the base and collector are:

\displaystyle I_B = \frac{V_{\text{In}} - V_{\text{BE(SAT)}}}{R_1} \quad I_C = \frac{V_{cc} - V_{\text{CE(SAT)}}}{R_2}.


Based on the condition that the ratio of the collector current and base current be less than the gain, I_C < I_B \beta, we’ll find that

\displaystyle \underbrace{ \frac{1}{\beta}\frac{R_1}{R_2} \left ( V_{cc} - V_{\text{CE(SAT)}} \right )  + V_{\text{BE(SAT)}}}_{V_H} < V_{\text{In}}


Center: Transfer characteristics of the NOT-Gate. Resistor values will be picked as to minimize the active linear region.
transistor-switch


Given the latest piece of information, it’s possible to decide on a value of one resistor and pick the value of the other such that the collector resistor is several times larger than the base resistor and greater than 25 \ \Omega to ensure that the max current, I_{\text{C(Max)}} = 200 \text{ mA}, into the transistor is not exceeded. Based on these criteria, I went with R_1 = 330\ \Omega, R_2 = 1 \text{ k} \Omega.

Logic
Center: Logical operations performed on the PWM and timing signals for the left and right output signals.
logic-output


Given the timing signal and the PWM signal, we’ll produce a composite signal that will retain the period of the timing signal while controlling the observed amplitude from the PWM signal. This will give a signal that can be used to control one of the terminals of the gearmotor. Since the gearmotor terminals are designed to have one be in a ground state while the other in a high state, there will be a second composite signal that is simply the composite of the negation of the timing signal with the PWM signal.

Center: Completed logic circuit.
logic


Since there were multiple signals to conjoin, I opted to use the 7408 quad, two-input AND-gate integrated circuit based on Transistor-transistor logic (TTL), a more efficient way of approaching the problem. I could have just as well used RTL to perform the conjunction, but the protoboard real estate required exceeded what I’d allocated and it simplified the design. There were NOT-gate integrated circuits that I could have used as well (e.g., 7404), but I decided to use the RTL based solution since it gave me an opportunity to learn the basics of transistors which would be required to understand the motor driver.

Driving the Motor
Center: Motor driver circuit. Source: Adapted from Texas Instruments SN754410 Datasheet Page 6, Figure 3.
h-bridge


With the motor controller specified, it’s time to look at the motor driver. The gearmotor will draw a fair amount of current and since most of the logic circuitry is designed for low current consumption, it’s not feasible to drive the motor using the controller. Instead, an H-Bridge will be used to supply enough current while isolating the controlling logic.

An H-Bridge consists of several transistor switches and fly back diodes for controlling the flow of current. The motor is designed to rotate clockwise or counterclockwise depending on the polarity of the charges applied to the terminals. Since we’ve got two signals that designate which terminal is ground and which is high, it’s a matter of feeding the signals to the inputs of the bridge that correspond to rotating in the desired direction.

I’d put together two designs, one using bipolar junction transistors (BJT) and one using the SN754410 quadruple half-H-bridge integrated circuit. The first required a handful of components and wanting to better understand transistors, I opted to go this route for prototyping. In creating the production protoboard I decided to go with the SN74410 for reason’s I’ll cover in that section. As far as the design is concerned, they are functionality identical for exposing a higher voltage source to a motor while insulating the lower voltage controller circuitry.

Completed Motor Controller
Center: Motor controller protoboard circuit. Wires between nodes are represented as lines with arrows and traces are solid lines. Primary output (gold), intermediate results (blue), ground (black), voltage high (red).
soldered---motor-control-pc


With a full list of schematics for the motor controller, the next step is to design the circuit that will be soldered to the controller’s protoboard. The motor controller will sit in one of the legs of the product and reside on a 2 \times 8 \text{ cm} protoboard. With the limited real estate, it is necessary to utilize each position on the protoboard. To do this, the PWM and timing circuits occupy the right-hand side of the board, while the logic circuitry occupies the left side. Two block terminals are used to route input and output signals for the logic voltage and ground, and the two output signals. The motor driver itself will reside along with the circuitry for the LEDs which will be covered in the next section.

LED Control

Overall, the product consists of a single red LED for the eye, three blue LEDs for the chest and three green indicator LEDs used for debugging the circuit. The eye LED is always on, and a constant 6 \text{ V} charge is always supplied when the product is on. The three LEDs are driven by a triangle oscillator and each of the indicator lights by their respective signals. One complication in designing the indicators is that the idealist view of driving them with 6 \text{ V} and switching them on and off with a transistor switch controlled by their respective signals consumes a lot of protoboard real estate. As a compromise, the indicators are driven by their signals.

Since the triangle oscillator represents the bulk of the circuit, this section will be dedicated to its analysis, operation and characteristics.

Voltage Divider

voltage-divider

Since we’ll be using a single supply operational amplifier design to control the chest LEDs, we’ll need to create a reference voltage between the supply and ground. This will be achieved using a voltage divider. By Kirchhoff’s voltage law we have V_{cc} = I (R_1 + R_2) which means that the voltage difference from left to center across the first resistor is V_1  = \frac{R_1}{R_1 + R_2}V_{cc} and V_2 = \frac{R_2}{R_1 + R_2} V_{cc} across the second resistor to ground. Since we want a voltage half that of the reference, R_1 = R_2 so that V_1 = V_2 = \frac{1}{2}V_{cc}.

Non-inverting Schmitt Trigger
Center: Voltage transfer characteristic of the non-inverting Schmitt Trigger which exhibits a hysteresis effect.
schmitt-trigger-transfer-ch

In the timing section we focused on creating a square wave output using a 555 timer. While this is one way to go about it, another is to use an operational amplifier based non-inverting Schmitt Trigger. The idea is that for a given voltage input, V_{\text{In}}, the output, V_{\text{Out}}, will be either the operational amplifier’s rail low, V_L, or rail high, V_H, voltage depending if the input voltage is increasing or decreasing past either the transition to lower, V_{TL}, or transition to higher, V_{TH}, voltage thresholds.

Center: Non-inverting Schmitt Trigger with reference voltage.
non-inverting-schmitt-trigg

In this configuration, the operational amplifier is used as a positive feedback loop, in which case there are two defining characteristics of its ideal behavior: (1) the output voltage of the amplifier, V_{\text{Out}} = A(V_{+} - V_{-}), is linearly proportional to the difference between the two terminals on the order of the amplifier’s gain, A, (2), there is no current flowing in to either of the terminals, I_{+} = I_{-} = 0. Based on these assumptions, we’ll apply Kirchhoff’s current law to the non-inverting terminal of the amplifier to determine the appropriate values for V_{TL} and V_{TH}.

\displaystyle I_{+} = I_{1} + I_{2} \implies 0 = \frac{V_{+} - V_{\text{In}}}{R_1} + \frac{V_{+} - V_{\text{Out}}}{R_2}

There are two cases to explore, in the first, we’ll assume V_{\text{Out}} = V_{H}. For that to be true, V_{+} \ge V_{\text{Ref}} due to characteristic (1). In the second, we’ll assume that V_{\text{Out}} = V_{L}, which means V_{+} \le V_{\text{Ref}} for the same reasons. Based on these two different sets of assumptions we find the following relationships.

\displaystyle V_{\text{TL}} = \left(\frac{R_1}{R_2} + 1 \right)V_{\text{Ref}} - V_H \left(\frac{R_1}{R_2}\right) \quad V_{\text{TH}} = \left(\frac{R_1}{R_2} + 1 \right)V_{\text{Ref}} - V_L \left(\frac{R_1}{R_2}\right)

On its own, the circuit won’t generate a square wave, but as the input varies with time, the circuit will flop-flop between ground and the supply voltage. The timing of which will be determined in part by the circuit’s noise immunity, i.e., the difference between the thresholds.

Inverting Integrator
Center: Inverting integrator with reference voltage.
inverting-integrator

To complete the triangle oscillator, we’ll need to review the inverting integrator based on an operational amplifier. The idea behind the circuit is that as the input voltage varies with time, the output voltage will be the negated accumulation of that input.

The inverting integrator is an example of a negative feedback loop. The characteristics that applied to analyzing a positive feedback loop also apply to analyzing a negative feedback loop with the additional characteristic that (3) the voltage of the two terminals is identical, V_{+} = V_{-}.

Par for the course, we’ll start by applying Kirchhoff’s current law to the inverting terminal of the operational amplifier.

\displaystyle I_{-} = I_R + I_C \implies 0 = \frac{V_{\text{In}} - V_{-}}{R} + C \frac{d}{dt} \left ( V_{\text{Out}} - V_{-} \right )
\displaystyle \implies \int_{0}^{t} \frac{d}{dt} \left ( V_{\text{Out}}(t) - V_{\text{Ref}} \right ) \, dt = -\frac{1}{RC}\int_{0}^{t} V_{\text{In}}(t) - V_{\text{Ref}} \, dt
\displaystyle \implies V_{\text{Out}}(t) = -\frac{1}{RC}\int_{0}^{t} V_{\text{In}}(t) \, dt + \left(\frac{t}{RC} + 1 \right ) V_{\text{Ref}}


Let’s assume that the input voltage can only take one of two values V_{\text{In}} = V_{cc} and V_{\text{In}} = 0 and that the reference voltage is half the maximum of these two voltage, V_{\text{Ref}} = \frac{1}{2} V_{cc}. Based on these assumptions the observed output voltage is then:

\displaystyle V_{\text{Out}}(t) =  \frac{1}{2}\left(1 \pm \frac{t}{RC} \right ) V_{cc}

When V_{\text{In}} = V_{cc} the output will decrease linearly and when V_{\text{In}} = 0 the output will increase linearly. If we uniformly toggle back and forth between these two values, then the output voltage will produce a triangle wave.

Triangle Oscillator
Center: Schmitt Trigger input (black) and output (blue).
triangle-oscillator-output

Now that the Non-inverting Schmitt Trigger and Inverting Integrator have been covered, it’s time to loop the two together so that the trigger’s output feeds into the integrator’s input and its output into the trigger’s input. Assuming that when the circuit is started that the trigger’s output is the high state, the input – and hence the integrator’s output- has to be greater than the trigger’s upper threshold. (The complementary set of events would take place if we had instead assumed that the trigger was originally outputting a low state.)

As the trigger’s output continues to be the high state, the inverting integrator’s output will linearly decrease over time. This output will continue to be fed back into the trigger until the trigger’s lower threshold is surpassed. Once this happens, the trigger’s output will be the low state. As the trigger’s output continues to be in the low state, the integrator’s output will linearly increase over time. This output will continue to be fed back into the trigger until the trigger’s upper threshold is surpassed. The trigger’s output will then be the high state and the whole cycle will repeat itself.

Center: Triangle Oscillator circuit using a LM358 dual operational amplifier. Source: Adapted from Op Amps for Everyone Page A-44 Figure A-44.
triangle

The completed circuit consists of a voltage divider, trigger and integrator around a LM358 dual operational amplifier. To provide a wide range of frequencies, I opted to use a potentiometer to control the frequency of the output rather than relying on a single fixed value resistor. This was done since I didn’t know what would be the ideal frequency and it bought me a range of solutions and not just one.

Based on the equations derived, the frequency and maximum outputs of the system are:

\displaystyle f = \frac{R_F}{4 C R_1 R_2}

Using values of R_1 = R_2 = 20 \text{ k}\Omega, R_F = 0-100 \text{ k}\Omega and C = 10 \ \mu \text{F}, buys a frequency range of 0-6.25 \text{Hz}. For the purpose of flashing a series of LEDs, this is sufficient. As far as the extrema of the output voltage is concerned, we are looking at:

\displaystyle V_{\text{Out}} = \frac{1}{2} \left( 1 \pm \frac{R_2}{R_F} \right) V_{cc}

The design will use blue LEDs which come with a voltage drop of about 3 \text{ V} meaning that for a value of R_F = 20 \text{ k}\Omega we should expect fairly triangular output in the lights, but as that value increases, and the resulting output window narrows, we’ll see only blips of light fade in, then out followed by a period of darkness before cycling.

Completed LED Controller
Center: Triangle oscillator, motor driver and LEDS protoboard circuit. Primary output (gold), intermediate results (blue), ground (black), logic voltage high (red), motor voltage high (purple).
soldered---main-pcb-schemat

All of the LEDs in the product in the chest cavity of the body and reside on a 5 \times 7 \text{ cm} protoboard. While there are numerous positions, most of the components on the protoboard were required to be in specific locations and took up a fair amount of space. The triangle oscillator took up the left-hand side of the circuit with the motor driver taking up the right-hand side. The top of the circuit consisted of the chest and eye LEDs and the bottom of the circuit had input and output block terminals for taking in logic and driving voltage as well as ground, and the two motor control signals. The output block terminals are then connected to the motor.

Power Control

In designing the electronics for managing the power in the product, I chose to provide two separate voltage sources: two AAA batteries giving a combined 3 \text{ V} and a single 9 \text{ V} battery. The former is used to power a boost converter up to 5 \text{ V} for the purpose of powering all of the logical circuits in the product; the latter is used to power a 6 \text{ V} linear regulator for the purpose of powering the LEDs and motor. Both voltage sources are controlled by a single latching push button.

Of the subcomponents in the circuit, the boost converter is the most interesting; this section will be primarily devoted to discussing its analysis, operation and characteristics.

Series-Parallel Resistor-Inductor-Capacitor (RLC) Circuits

To begin talking about the boost converter, it’s necessary to talk about the series-parallel RLC circuit which differs from the standard series and parallel circuit topologies in that the inductor runs in series to a resistor and capacitor in parallel.

Center: Series-parallel RLC circuit.
series-parallel-rlc

While the topologies may differ, the characteristics used to simplify the analysis of RLC circuits remain the same. When convenient, the following substitutions will be made:

Characteristic Equation
Natural frequency \omega_0 = \sqrt{\frac{1}{LC}}
Dampening attenuation \alpha = \frac{1}{2RC}
Dampening factor \zeta = \frac{\alpha}{\omega_0}
Dampened natural frequency \omega_d = \sqrt{\alpha^2 - \omega_0^2}

Based on Kirchhoff’s voltage law the voltage source is the sum of the voltage across the inductor and that of the RC subcircuit. Kirchhoff’s current law says that the amount of current flowing through the inductor is the same as the aggregate current flowing through the capacitor and resistor. Based on these assumptions, we arrive at the following second order linear nonhomogeneous ordinary differential equation:

\displaystyle V_{cc} \omega_0^2 = \frac{d^2}{dt^2} V + 2 \alpha \frac{d}{dt} V + \omega_0^2 V

The general solution for a differential equation of this form is to take the superposition of the homogeneous solution with the particular (nonhomogeneous) solution. For the former we’ll use the characteristic equation of the homogeneous equation and then the method of undetermined coefficients for the latter.

\displaystyle \lambda^2 + 2 \alpha \lambda + \omega_0^2 = 0 \implies \lambda = - \alpha \pm \sqrt{\alpha^2 - \omega_0^2} = - \alpha \pm \omega_0 \sqrt{\zeta^2 - 1} = -\alpha \pm \omega_d

Since we have a second order equation, we can run into repeated real roots (critically damped \zeta = 1), unique real roots (overdamped \zeta > 1) and unique complex roots (underdamped \zeta < 1) when solving the characteristic equation. Since the end goal is to understand to explore the boost converter, only the underdamped case will be reviewed. For the particular solution, V_p, we have a constant forcing function, g(t) = V_{cc} \omega_0^2, so the method of undetermined coefficients says we’ll end up with a constant valued particular solution. Taking these results together we arrive at:

\displaystyle V = e^{-\alpha t} \left( c_0 \cos(\omega_d t) + c_1\sin(\omega_d t) \right) + c_2


In order to determine the coefficients’ values, we’ll need to determine the initial conditions of the system. Initially, the inductor will resist any change in current, so since there is no current, the initial current is zero, I(0) = 0. If no current is following, then the inductor acts briefly like a switch and the voltage is then zero, V(0) = 0. Looking at the limiting behavior of the circuit, the inductor will turn into a plain connection and the capacitor will become fully charged and disappear from the circuit; this means that the steady state voltage will become the source voltage, c_2 = V_{cc}. As a result we end up with the following solution:

\displaystyle V = V_{cc} \left( 1 - e^{-\alpha t} \left( \cos(\omega_t) + \frac{\alpha}{\omega_d} \sin(\omega_d t) \right) \right)


Center: Output of an under dampened series-parallel RLC circuit with a low dampened natural frequency.
series-parallel-rlc-voltage


Looking at the voltage over time we see that the output voltage is greater than the input voltage at the beginning of the circuit’s uptime and as time elapses, the output voltage converges to the input voltage. The peak output voltage will be observed very early on at t_{\text{Max}} = \frac{\pi}{\omega_d}:

\displaystyle V_{\text{Max}} = V_{cc} \left ( 1 + \alpha \exp{ \left(-\frac{\alpha \pi}{\omega_d} \right ) } \right )


We’ll leverage this behavior to get the boost in voltage from the boost converter.

DC-DC Boost Converter
Center: DC-DC Switching Boost Converter.
boost-converter


The boost converter is a way of converting an input voltage to a higher output voltage by switching between two different states that charge and discharge the inductor’s magnetic field. In the charging state, S_1 is switched closed and S_2 is switched open for a period of time t_{\text{On}} resulting in two isolated circuits consisting of a single voltage supply and inductor circuit and a RC circuit. The RC circuit will discharge to produce a decreasing output voltage. In the discharge state, S_1 is switched open and S_2 is switched closed for a period of time t_{\text{Off}} resulting in a series-parallel RLC circuit that produces in an increasing output voltage. The output voltage is therefore the average of voltage over switching between the two states.

Center: Simplified boost converter output.
boost-converter-output


For a thorough analysis of the boost converter, you should refer to Wens and Steyaert from which the following input-output voltage relationship is attributed.

\displaystyle V_{\text{Out}} = V_{\text{In}} \frac{1}{1 - \delta} \quad \delta = \frac{t_{\text{On}}}{t_{\text{On}} + t_{\text{Off}}}


As the duty cycle \delta increases from zero to one, the output voltage will start off as the input voltage and increase towards infinity. Realistically though, this is not obtainable and a 5x multiple is a more reasonable upper bound.

3V to 5V Boost Converter

To realize this boost converter design, I went with the Maxim MAX630 to serve as the first switch in the system and a 1N4148 diode to serve as the second switch. (The diode functions as a switch by only allowing current to move in one direction.) According to the Maxim datasheet, the MAX630 works by monitoring the voltage on VFB and when it is too low, the MAX630 oscillates its internal N-channel MOSFET at a high frequency open and shut on LX to put the system into the charging state. Once VFB is above the desired voltage, LX is left open to put the system into the discharging state. This cycle repeats until the system is powered off.

Center: Boost converter. Source: Adapted from Maxim 630 Datasheet Page 11 Figure 5.
booster


Due to the oscillatory nature of the charging phase used by the MAX630, the analysis that was performed for the series-parallel RLC circuit is cumbersome to use here to determine the appropriate values for the passive components. Fortunately, the MAX630’s datasheet had a schematic for a 3 \text{ V} to 5 \text{ V} boost converter utilizing an inductance of \L = 470 \ \mu \text{H} and capacitance C = 470 \ \mu \text{F}. The voltage dividers on the left-hand side of the schematic are used for low battery detection and the voltage divider on the right-hand side is used in reference to the voltage comparison done by the VBF input. Based on the datasheet these values come out to be R_1 = 249 \text{ k} \Omega, R_2 = 499 \text{ k} \Omega, R_3 = 200 \text{ k} \Omega and R_4 = 540 \text{ k} \Omega.

Completed Power Controller
Center: Power management protoboard circuit. Primary output (gold), intermediate results (blue), ground (black), voltage high (red).
soldered---power-management


The power controller will sit in one of the legs of the body and reside on a 2 \times 8 \text{ cm} protoboard. The voltage regulator sits on the left-hand side of the circuit while the boost converter occupies the right-hand side. In between are the block terminals for taking in the on-off switch, grounds, 3 \text{ V} and 9 \text{ V} supplies. Above that block terminal is the output terminal providing 6 \text{ V}, 5 \text{ V} and ground.

Building Phase

Sourcing

Center: 3D printed enclosures, acrylic plates, protoboards, electronic, electromechanical and mechanical components.
parts-2


One thing that surprised me perhaps more than anything about this project was how difficult it was to find the right parts having the desired characteristics. Overall, I had orders with about a half dozen vendors from here in the United States and abroad.

3D Printing services were carried out by Shapeways, Inc. out of New York, New York. After receiving my package, I noticed a missing piece. After contacting their customer service they were able to resolve the matter and ship me a replacement part. Evidently since the missing piece was inside the main shell the operator didn’t see it on the reference, so it didn’t get shipped. The hiccup delayed me by about two weeks, but nonetheless, they made right by the mistake.

The Acrylic plates used on the front of the product were sourced from TAP Plastics of Mountain View, California. Painting supplies and adhesives and additional finishing tools were acquired from McGuckin’s Hardware store of Boulder, Colorado.

Machine elements were received from McMaster-Carr of Elmhurst, Illinois. They had quite possibility the fastest order placement to shipping time I’ve ever seen. I’d love to see the system that powers that operation. Additional elements were acquired from various Amazon merchants and local big-brand hardware stores. Gearmotor was purchased from Sparkfun Electronics of Boulder, Colorado.

Electronic components were primarily received from Mouser Electronics of Mansfield, Texas. Their selection and speed of shipping were superb. Additional components were purchased from electronics store J. B. Saunder’s of Boulder, Colorado when I needed something quickly and from various Amazon merchants.

In terms of cost of these parts, buying in bulk and in single orders saves on per item cost and on shipping. Buying just the components needed would have ended up being more expensive than buying them in bulk. In effect, multiple versions of the product could be produced cheaper than just producing one.

Prototyping

Industrial Design

Left-to-right: Prototype reverse and front. Shaded regions represent volumes that would be knocked out in the final design.
prototype


To develop a sense of how the product would come together, it was helpful to construct a cardboard based version of the final product based off the measurements I’d put together during the design phase. This enabled me to understand proportions, and the working space for the electronic and mechanical components. It also helped remind me that I was working towards a well-defined end goal.

Mechanical

As in the previous section, I also put together some cardboard based prototypes of the drive system. This consisted of a couple cardboard gears, a mocked up motor and a couple straws. Not being very savvy when it comes to all things mechanical, it was helpful to see the parts in action before committing to anything.

Once I had purchased the machine elements I wanted to see how the shafts and everything would mesh together so I decided to make a wooden version of the motor carriage. My thinking here was that if it was easy enough to make I could skip having that component 3D printed to save on cost. After a few trips to the lumber store, some careful drilling and wood glue, the motor carriage was put together and I was able to verify that the axle and motor assemblies would mesh together and be capable of reliably holding everything in place.

From this exercise I concluded that it wasn’t worth the extra effort to really spend a lot of time on the wooden version. I simply didn’t have the right tools or workspace to get the kind of precision that would be needed to make everything run smoothly so I proceeded to think about what the 3D printed version would look like.

Electronics

Working with the electronics was a bit of a steep learning curve to traverse, but as time went on, it became easier to translate circuit diagrams to the breadboard. Coming from a software background, I put together the circuits in as modular a way as possible to facilitate testing of sub circuits in isolation. This made it significantly easier than attempting to debug issues with the circuit as a whole.

Production

Industrial Design

Left-to-right: Reverse and front of the 3D printed components consisting of back plate, body, arms and motor bracket.
3d-print


3D printing of the product was done by laser sintering. This is a process where a thin layer of ESA Polyamide 2200 is laid down and then the cross section of the product is heated to bond the material together with a new layer added and the process repeated until the volume is rendered.

After a month of modeling the product in OpenSCAD, the resulting STL file was submitted to be printed. After ten days, the product was fabricated and delivered. As an observation, the end result had a look and feel similar to that of a sugar cube. Overall, the detail on the product came out crisp and only those openings whose diameter was less than 2mm ended up coming out slightly deformed on one side. The rest of the details came out well. The back plate logo and copyright text as well as the interior WEEE symbol, RIC symbol and copyright text came out crisp and legible despite having fine details.

From left-to-right: Primed body, first coat of paint and final coat of paint.
painting


From here, it was time to undertake the process of giving the exterior of the model an aluminum looking finish. This was achieved by applying an aerosol primer for plastics and several coats of a metallic paint that formed a firm film of enamel that added some extra strength to the body. In between coats, the body was sanded down with finer and finer grit to remove any imperfections or inconsistencies introduced during the spray paint process. I decided to keep the striations from the printing process since it gave the finished product a more believable brushed metal look. I didn’t paint the interior since I didn’t want any miniscule metallic flakes from the paint to potentially interact with the electronics.

From left-to-right: Finished product reverse and front.
assembly


The acrylic for the eye and chest plate was cut by hand and seated into the body with an epoxy for binding acrylic to plastics. To give the chest plate the same look as the original illustration, several layers of electrical tape were placed on the back of the chest plate and the openings were cut out with an X-Acto knife. Each of the mechanical and electronic components that were part of the body was then secured with additional adhesives. To make sure the mechanical components lined up properly I threaded an aluminum shaft through the ball bearings and then glued each bearing to the body or bracket. Once the adhesive had dried, it was easy to slowly pull the shaft back out.

Mechanics

Center: Arms, bracket, motor and gear train in relation to the arms.
mechanics


The final production work on the mechanics dealt with securing the pinion gear to the motor shaft, motor to the chassis, bevel gears and machine elements to the shafts and finally securing the arms. One of the major complications in putting the mechanical system together was the fact that many of parts came from different vendors and possessed a mixture of imperial and metric units. As a result, things were done in more of a roundabout way than I would have liked to realize the original design. C’est la vie.

Starting with the gears themselves, the bevel gear had racetrack shaped interior diameter of about 4mm. The closest aluminum dowel that would fit was 5/64” in diameter. Being a nonstandard imperial diameter, I went with a 3/16″ diameter rod since it was a more prevalent diameter among the hardware vendors. To compromise, the bevel gears were attached and centered to the 5/64” dowel with adhesive and left to set. Once set, they were then placed and centered inside the 3/16″ rod and fixed with adhesive.

The pinion gear and the motor’s shaft both had a 3mm radius, but the shaft was D-shaped since it was intended to mate with a RC car wheel. (Coincidentally, the gears were part of a larger differential gear set intended for a RC car.) Despite the mismatch, the pinion and motor shaft shared the same diameter, so it was easy to secure the pinion on the shaft with adhesive.

You’ll note that fewer shaft collars were used as a consequence of this complication which was primarily rooted in my choice of gears. I didn’t have many options when it came to gears, and I went with the best of my worst options since it was cheaper to purchase the gears as a set, than it was to go out and buy all the gears individually for far more than I was willing to pay. Nonetheless, everything came together within reason.

Electronics

Center: Protoboards fastened to back plate consisting of power management (top left), motor and LED control (right) and timing (bottom left).
electronics


Transcribing each portion of the prototype from the breadboard over to the protoboard was a challenge that lasted for a few months. I’d spent a fair amount of time and was becoming fatigued by the experience and had hit a low point in terms of morale and motivation. As a result, I made mistakes that I shouldn’t have been making and I recognized I needed to change what I was doing if I was going to finish the project. After taking a two week break and thinking I’d gotten things under control by double and triple checking my designs and taking my time to make sure I wasn’t putting parts in backward or offset or connecting parts together that shouldn’t be by accident, I ran into a major problem.

I could not identify a short in my original BJT based H-bridge design. After reviewing the protoboard layout, breadboard layout, the design, my reference material and datasheets, I was stumped. This went on for weeks and I realized that this was just something more involved to get right that I had led myself to believe and that I needed to move on. As a result, I compromised on the design and decided to use a quad half H-bridge integrated circuit in place of a BJT based H-bridge design.

I also concluded that I needed to change my approach to the power management for the project. I felt that drawing current from a single voltage supply for both the logic and drive wasn’t the right thing to do and that I need to split these concerns into their own dedicated voltage supplies. Not wanting to just throw another 9 \text{ V} battery into the mix, I decided I would go with the boost converter off of a 3 \text{ V} supply in order to supply 5 \text{ V} to the logic components while leaving the existing 9 \text{ V} supply to be regulated down to 6 \text{ V} for the motor and LEDs.

In retrospect it wasn’t the right decision since it meant adding complexity to the end result. It also meant desoldering a lot of work and spending additional time and money on new parts and a new design. But at this point, I had committed to the change and proceeded. After receiving the new parts I went through another round of testing on the new designs on the breadboard and concluded that the changes would work and proceeded to solder the changes to the protoboards.

Ironically, the boost converter and quad half H-bridge integrated circuit were the easiest things to map to the protoboard and any doubt that I could not get the final electronics to work were gone. Despite the big change and the frustration, I felt hat I had turned things around and was back on track.

Having finished the protoboards I fastened them to the back plate and made sure there was enough room in the body for everything to fit. I’d given myself some room between the machinery and the electronics, but not enough for the wires to lie in between. With some electrical tape I was able to bind the wires tightly and secure everything in its place and was finally ready to test drive the end result.

Evaluation Phase

Testing

Having put so much effort into the body of the product, I didn’t really have the heart to subject the 3D printed bodies through any serious stress tests. In handling the material and developing a feel for it, I didn’t develop an impression that it was overly fragile; it withstood several rounds of aggressive sanding, boring and drilling without fracturing and warping. For me, this was good enough for something that would ultimately find a home on my bookshelf.

To identify any problems with the machinery of the product, I oriented the arms of the robot in to the various positions I talked about in the requirements section to see how it would behave. All the elevated orientations resulted in the arms swinging down then being driven by the machinery. I attribute this mainly to how the arms were fastened to the shaft by being pinched between two shaft collars and padded from the shaft with some electrical tape. In one sense this was good since it was putting too much strain on the motor, but on the other hand disappointing. Despite this complication, I decided that I was ok with just having the arms hanging down and shuffling back and forth.

The second big part of the character of the product was the illumination of the eye and chest. The eye ended up having plenty of light while the chest merely flickered on and off. Changing the power supply design and dimensions of the body resulted in a reduced voltage and increased displacement resulting in the diminished output. Since things were already soldered, I chose to leave things as is.

Future Work

The end result here isn’t without flaws. Working through the project I recognized along the way that there were things I hadn’t done quite right or that just didn’t sit with me well. The following is a list of things I would try to keep in mind next time I take on a project like this.

As far as the 3D printing process went, I would go back and redo how I incorporated the latching mechanism for the motor chassis and back plate. The biggest problem here was that post fabrication modifications had to be made since I incorrectly understood the tapping process. Nothing ruins precision faster than making changes by hand.

I had overlooked the theory behind illumination and had instead focused more on intuition. In the future I would spend more time reading up literature on the right amount of light to use based on what I wanted to illuminate and the different techniques that exist for providing different types of coverage. In retrospect, I think this would have given the end result a more polished look.

The mechanical work was complicated by the impedance between the imperial and metric standards of the parts involved. Part of this was poor planning on my part; part was difficultly finding the right parts at a hobbyist price point. Nonetheless, I’d like to continue to develop my understanding of mechanical systems and how they can be incorporated into electronically driven solutions.

I would have also incorporated some wiring management directly into the part so that it was less of a hassle to fit the back plate to the body with everything. I’d also switch to an existing cable management system instead of relying on screw terminals so that it was easier to snap things together and give the board a lower profile to save on space.

I’d like to explore printed circuit boards the next time I approach a project like this. My knowledge of circuit design going into this was limited, and it would have meant a lot of wasted time, material and money had I gone ahead and ordered PCBS this round. Given that I now have a working model to base future work, I would like to explore this route in the future.

As far as the on-off functionality goes, next time I think I will use a series of relays to switch access to the voltage supplies whenever a momentary push button is engaged. I think this would lead to a cleaner separation of the two voltage supplies.

The timing circuit was complicated by the initial ramp up time giving rise to a slow initial rotation until the threshold was reached to go in to astable mode. In the future I’d like to come up with a way to eliminate that initial ramp up from showing up in the output of the arms. Related to this, I’d like to be able to control the length of the timing pulses to swing between clockwise and counterclockwise rotations. In all likelihood, I’d use a microcontroller since it would give me the greatest range of flexibility.

Conclusion

With the finished product sitting on my bookshelf and reflecting on this project, the seasons it encompassed and the ups and downs I worked through, I have developed a greater appreciation for mechatronics, the physical product design cycle and the work people put into everyday products.

Taking the time to make something tangible for a change presented me with a number of challenges that I hadn’t had to face before and that’s what I enjoy most about these kinds of projects. It’s really about developing a new set of tools, techniques and thinking that I can apply to problems that arise in my personal and professional work.

This project allowed me to explore a number of interesting concepts within the framework of a seemingly simple toy. Let’s iterate over the main bullet points:

  • Analog circuit design- complete analysis and use of passive components coupled with semiconductors with first real exposure to transistors, operational amplifiers and 555 timers.
  • Protoboard design, soldering, debugging and desoldering techniques.
  • Exposure to driving DC Motors using various techniques.
  • Better understanding of hardware development and the product design process.
  • Learned about industrial design guidelines and techniques for making cost effective products using 3D printed materials.
  • Use of the finite element method to perform stress analysis of a complex geometric object. (Finally had an excuse to learn tensors.)
  • Learned how to use an assortment of CAD, CAM and CAE software solutions.

Overall, the project produced a number of positive outcomes. As a stepping stone, this project has left me wanting to explore mechatronics more deeply and I’ve got a number of ideas brewing in mind that could lead to more advanced “toys” in the future. I feel confident that I can take the lessons learned from this experience and avoid pitfalls that I might encounter in more advanced projects of similar focus going forward. For now, those ideas will have to wait as I return to my world of code and numbers.

About the Author

Untitled-2 Garrett Lewellen is a software developer working at a private start-up in the Denver Metro Area designing and developing SaaS-based systems. With eight years’ experience and formal education in computer science with emphasis in applied mathematics, his primary interests lie in the application of statistical models to problems that arise in general computing. When he’s not working on projects, he’s out exploring the Rocky Mountains and enjoying the great outdoors.

Copyright

“3D Printed Toy Robot” available under CC BY-NC-ND license. Copyright 2013 Garrett Lewellen. All rights reserved. Third-part trademarks property of their respective owners.

Bibliography

Part and Mold Design. [pdf] Pittsburgh, PA: Bayer Material Science, 2000. Web.

“A.5.8 Triangle Oscillator.” Op Amps For Everyone Design Guide (Rev. B). [pdf] Ed. Ron Mancini. N.p.: n.p., 2002. N. pag. Texas Instruments, 22 Aug. 2002. Web.

Boost Converter.” Wikipedia. Wikimedia Foundation, 09 July 2013. Web. 15 Sept. 2013.

What Is PWM? Pulse Width Modulation Tutorial in HD.” Electronics Tutorial Videos. N.p., 28 Nov. 2011. Web. 11 Sept. 2013.

Amado-Becker, Antonio, Jorge Ramos-Grez, María José Yañez, Yolanda Vargas, and Luis Gaete. “Elastic Tensor Stiffness Coefficients for SLS Nylon 12 under Different Degrees of Densification as Measured by Ultrasonic Technique.” [pdf] Rapid Prototyping Journal 14.5 (2008): 260-70. Web.

Chaniotakis. Cory. “Operational Amplifier Circuits Comparators and Positive Feedback”. [pdf] 6.071J/22.071, Introduction to Electronics, Signals, and Measurement. Spring 2006 Lecture Notes.

Cook, David. “Driving Miss Motor.” Intermediate Robot Building. 2nd ed. Apress, 2010. N. pag. Print.

Cook, David. “H-Bridge Motor Driver Using Bipolar Transistors.” Bipolar Transistor HBridge Motor Driver. N.p., n.d. Web. 11 Sept. 2013.

EOS GmbH – Electro Optical Systems, “PA 2200”: [pdf] Material sheet, 2008.

Demircioglu, Ismail H. “Dynamic Model of a Permanent Magnet DC Motor”. [pdf] 11 Aug. 2007.

Jung, Walt, ed. Op Amp Applications Handbook. N.p.: Analog Devices, 2002. Web.

Kim, Nam H., and Bhavani V. Sankar. Introduction to Finite Element Analysis and Design. 1st ed. New York: John Wiley & Sons, 2009. Print.

Lancaster, Don. RTL Cookbook. [pdf] 3rd ed. Thatcher, Arizona: Synergetics, 2013. Web. 11 Sept. 2013.

Maksimović, Dragan. “Feedback in Electronic Circuits: An Introduction”. [pdf] ECEN 4228, Analog IC Design. Lecture Notes 1997.

Mantzel, Jamie. 3D Print Big Robot Project No. 1. N.d. Youtube.com. 11 Mar. 2012. Web.

Maxim, “CMOS Micropower Step-Up Switching Regulator”, [pdf] MAX630 datasheet, Sept. 2008. .

Movellan, Javier R. “DC Motors.” [pdf] 27 Mar. 2010.

National Semiconductor, “LM555 Timer,” [pdf] LM555 datasheet, July 2006.

Najmabadi, Farrokh. “Bipolar-Junction (BJT) transistors.” [pdf] ECE60L, Components & Circuits Laboratory. Spring 2004 Lecture Notes. .

Nikishkov, G.P. “Introduction to the Finite Element Method”. [pdf] 2004 Lecture Notes.

Platt, Charles. Make: Electronics (Learning by Discovery). 1st ed. Sebastopol, CA: O’Reilly Media, Inc., 2009. Print.

Roberts, Dustyn. Making Things Move DIY Mechanisms for Inventors, Hobbyists, and Artists. 1st ed. N.p.: McGraw-Hill, 2010. Print.

Sayas, Francisco-Javier. “A gentle introduction to the Finite Element Method”. [pdf] 2008 Lecture Notes.

Shenzhen Kinmore Motor Co., Ltd, “Outline”, [pdf] KM20100507 datasheet, Nd.

Texas Instruments, “LM158, LM158A, LM258, LM258A, LM358, LM358A, LM2904, LM2904V Dual Operational Amplifiers”, [pdf] LM358 datasheet, June 1976 [Revised July 2010].

Texas Instruments, “SN754410 Quadruple Half-H Driver”, [pdf] SN754410 datasheet, Nov. 1986 [Revised 1995].

Toledo, Manuel. “Basic Op Amp Circuits”. [pdf] INEL 5205, Instrumentation. Lecture Notes. 13 Aug, 2008.

Wens, Mike, and Michiel Steyaert. “Reflections on Steady-State Calculation Methods.” Design and Implementation of Fully-integrated Inductive DC-DC Converters. N.p.: Springer, 2011. N. pag. Print. Analog Circuits and Signal Processing.

Follow

Get every new post delivered to your Inbox.