# Antimatroid, The

thoughts on computer science, electronics, mathematics

## Distributed k-Means Clustering

Abstract
k-Means Clustering [10] is a fundamental algorithm in machine learning, and often the first approach a user will try when they want to discover the natural groupings in a collection of n-dimensional vectors. The algorithm iteratively picks cluster centers by assigning vectors to their closest cluster, then recalculates the centers based on the assigned vectors’ means. Here it is used as a toy algorithm to motivate the pilot study, design, implementation, and evaluation of a fault-tolerant distributed system with emphasis on efficient data transfer using remote direct memory access (RDMA), and a distributed synchronization barrier that accommodates transient workers. Experiments will evaluate the performance of these interests and quantify how well the system compares to Apache Spark before concluding with a discussion on transforming this pilot study into a high-performance, cross-platform, fault-tolerant distributed system for machine learning.

### Introduction

The applications of k-Means clustering are numerous in business, engineering, and science where demands for prompt insights and support for increasingly large volumes of data motivate the need for a distributed system that exploits the inherent parallelism of the algorithm on today’s emerging hardware accelerators (e.g., FPGA [3], GPU [15], many integrated core, multi-core [9]).

There are however a number questions that arise when building such a system: what accelerators should be supported, how to partition data in an unbiased way, how to distribute those partitions, what each participant will calculate from their partition, how those individual results will be aggregated, and finally, how to synchronize tasks between participants to complete the job as a whole.

Two of these questions will be the focus of this work: how to efficiently transfer data to workers, and how to synchronize them in the presence of transient workers. For the former, remote direct memory access (RDMA) is used to distribute disk-based data from a coordinator to workers, and an extension to synchronization barriers is developed to allow workers to leave ongoing calculations, and return without interrupting other workers.

How well the system solves these issues will be measured by observed transfer rates and per iteration synchronization times. To understand the system’s scalability and place in the broader distributed machine learning landscape, its runtime performance will be evaluated against Apache Spark. Based on these results, future work will be discussed on how to move forward with the study to create a system that can meet the ever growing demands of users.

### Background

$1: \textbf{procedure } \textsc{k-Means}(C, X, d, N, K) \newline 2: \qquad \textbf{while } \text{not convered } \textbf{do } \newline 3: \qquad \qquad S_k \gets \{ x : k = \text{argmin}_{i} \lVert c_i - x \rVert_2, x \in X \} \newline 4: \qquad \qquad \Sigma_{k} \gets \sum_{x \in S_k} x \newline 5: \qquad \qquad \kappa_{k} \gets \lVert S_k \rVert \newline 6: \qquad \qquad c_{k} = \Sigma_{k} / \kappa_{k} \newline 7: \qquad \textbf{end while} \newline 8: \qquad \textbf{return } C \newline 9: \textbf{end procedure}$

k-Means clustering belongs to the family of expectation-maximization algorithms where starting from an initial guess $C$, $K$ centroids are iteratively inferred from a dataset $X$ containing $N$ $\mathbb{R}^d$ vectors. Each iteration the partial sum $\Sigma_k$, and quantity $\kappa_k$ of vectors nearest the $k^{th}$ centroid are aggregated. From these quantities the updated centroids $c_k$ can be calculated for the next iteration until the difference in magnitude from the former is less than some tolerance $\epsilon$ or a maximum number of iterations $I$ is observed providing a $\mathcal{O}(dIKN)$ linear time algorithm.

Parallelism is based on each participant $p$ being assigned a partition of the data so that instead of computing $\Sigma_k$ and $\kappa_k$ on the entirety of $X$, it is on some disjoint subset $X_p$ s.t. $X = \cup_p X_p$ and $\cap_p X_p = \emptyset$. When each participant has computed their $(\Sigma_k^p, \kappa_k^p)$, pairs the resulting set of values can be aggregated $(\Sigma_k, \kappa_k) = \sum_{p} (\Sigma_k^p, \kappa_k^p)$ to yield the updated centroid values $c_k$ for the next iteration.

To ensure that participants can formulate a unified view of the data, it is assumed that user supplied datasets are adequately shuffled. If the natural clusters of the data were organized in a biased way, participants would individually draw different conclusions about the centroids leading to divergence. Further, it will be assumed that a loss of an unbiased partition will not degrade the quality of results beyond an acceptable threshold under certain conditions (cf. [6] for a rich formalism based on coresets to support this assumption).

#### RDMA

One of the failings of conventional network programming is the need to copy memory between user and kernel space buffers to transmit data. This is an unnecessary tax that software solutions shouldn’t have to pay with runtime. The high performance computing community addressed this issue with remote direct memory access [12] whereby specialized network hardware directly reads from, and writes to pinned user space memory on the systems. These zero-copy transfers free the CPU to focus on core domain calculations and reduces the latency of exchanging information between systems over 56 Gbps InfiniBand or 40 Gbps RoCE.

#### Barriers

Synchronization barriers allow a number of participants to rendezvous at an execution point before proceeding as a group (cf. [12] for a survey of advanced shared-memory techniques). Multi-core kernels in the system use a counting (centralized) barrier: the first $n-1$ threads that approach the barrier will enter a critical section, decrement a counter, and sleep on a condition variable; the last thread to enter the critical section will reset the counter and signal the condition variable to wake the other threads before leaving the critical section. At this point all threads are synchronized and can proceed as a group.

Since the emphasis of this work is on fault tolerance, distributed all-to-all and one-to-all message passing versions of the counting barriers are considered (cf. [4] for more advanced distributed methods). In the former, every participant will notify and wait to hear from every other participant before proceeding. In the latter, every participant will notify and wait to hear from a coordinator before proceeding. The all-to-all design most closely resembles a counting barrier on each participant, whereas the all-to-one resembles a counting barrier on a single participant.

### Design

System participants consist of a single coordinator and multiple workers. A coordinator has the same responsibilities as a worker, but is also responsible for initiating a task and coordinating recovery. Workers are responsible for processing a task and exchanging notifications. A task is a collection of algorithm parameters (maximum iteration, convergence tolerance, number of clusters $K$), initial guesses ($K$ by $d$-dimensions), and unbiased partition of the dataset ($N$ by $d$-dimensions) that a participant is responsible for computing. Notifications consist of identifying information (host name, port), the current iteration, and partial results ($K$ counts and $K$ by $d$-dimensional sums).

#### Dataset Transfer

Figure 1: Example of transferring data between hosts using Accelio RDMA.

The first responsibility of the coordinator is to schedule portions of the disk based dataset to each worker. The coordinator will consult a list of participants and verify that the system can load the contents of the dataset into its collective memory. If so, the coordinator schedules work uniformly on each machine in the form of a task. A more sophisticated technique could be used, but a uniform loading will minimize how long each worker will wait on the others each iteration (assuming worker’s computing abilities are indistinguishable). The coordinator will sequentially distribute these tasks using RDMA, whereby the serialized contents of the task are read from disk into the coordinator’s user space memory are directly transfered over to an equally sized buffer in the worker’s user space memory before continuing on to the next worker.

Open source BSD licensed Accelio library [11] is used to coordinate the transfer of user space memory between machines as shown in Fig. (1). Following the nomenclature of the library, a client will allocate and register a block of user space memory; registration ensures that the operating system will not page out the memory while the network interface card is reading its contents. Next, the client will send a request asking the server to allocate an equally sized buffer. When the server receives this request, it will attempt to allocate and register the requested buffer. Once done, it will issue the rkey of the memory back to the client. An rkey is a unique identifier representing the location of the memory on the remote machine. Upon receipt of this information, the client will then ask the library to issue a RDMA write given the client’s and server’s rkeys. Since RDMA will bypass the server’s CPU all together, the server will not know when the operation completes; however, when the RDMA write completes, the client is notified and it can then notify the server that the operation is complete by terminating the connection and session.

During development it was discovered that the amount of memory that can be transfered this way is limited to just 64 MiB before the library errs out (xio_connection send queue overflow). To work around this limitation for larger partitions, the client will send chunks of memory in 64 MiB blocks. The same procedure detailed above is followed, however it is done for each block of memory with rkeys being exchanged for the appropriate offset on each client RDMA write complete notification until the entire contents of memory have been transfered. On each side the appropriate unregistering and deallocation of ancillary memory takes place and the worker deserializes the memory into a task before proceeding on to the first iteration of k-Means algorithm.

As an alternative to using direct RDMA writes, a design based on Accelio messaging was considered. In this design the client allocates memory for the serialized task and issues an allocation request to the server. The server services the request and the contents of memory are transfered in 8 KiB blocks before the exchange of messages is no longer necessary. While this approach requires fewer Accelio API calls to coordinate, it is significantly slower than the more involved direct RDMA based approach.

#### Iteration and aggregation

The k-Means algorithm is implemented as both a sequential and parallel multi-core kernel that can be selected at compile time. The sequential version is identical to what was discussed in the background, whereas the parallel kernel is a simplified version of the distributed system less data transfer and fault-tolerance enhancements. Data is partitioned onto threads equaling the total number of cores on the system. Each iteration threads are created to run the sequential kernel on each thread’s partition of the data. Once all the threads complete, they are joined back to the main thread where their partial results are aggregated by the main thread and used for the distributed synchronization.

#### Synchronization

In this distributed setting, the barrier must be able to accommodate the fluctuating presence of workers due to failures. Multiple designs were considered, but the all-to-all paradigm was chosen for its redundancy, local view of synchronization, and allows the coordinator to fail. The scheme may not scale well and future work would need to investigate alternative methods. Unlike the data transfer section, plain TCP sockets are used since the quantities of data being shared are significantly smaller.

A few assumptions need to be stated before explaining the protocol. First, the coordinator is allowed to be a single point of failure for recovery and failures for all participants are assumed to be network oriented up to network partitioning. Network partitions can operate independently, but cannot be reunified into a single partition. All other hardware resources are assumed to be reliable. Finally, the partition associated with a lost worker is not reassigned to another worker. It is simply dropped from the computations under the assumption that losing that partition will not significantly deteriorate the quality of the results up to some tolerance as mentioned in the introduction.

The first step to supporting transient workers is to support process restart. When a worker receives a task from the coordinator, it will serialize a snapshot to disk, service the task, and then remove the snapshot. When a worker process is started, it will deserialize an existing snapshot, and verify that the task is ongoing with the coordinator before joining, or discard the stale snapshot before awaiting the next task from the coordinator.

Figure 2: Example in which a worker reintegrates with the group after being offline for several iterations. Red lines denote blocking. (Timeout queries not shown.)

The distributed barrier next maintains a list of active, inactive and recovered participants in the system. The first phase of synchronization is the notification phase in which each participant will issue identifying information, current iteration, and its partial results for the k-Means algorithm to all other active participants. Those participants that cannot be reached are moved from the active to inactive list.

The second phase is the waiting phase in which a listening thread accumulates notifications on two separate blocking unbounded queues in the background. The recovery queue is for notifications, and the results queue for sharing partial results. For each active participant the results queue will be dequeued, and for each inactive participant, the results queue will be peeked and dequeued if nonempty. Because the results queue is blocking, a timeout is used to allow a participant to verify that another participant hasn’t gone offline in the time it took the participant to notify the other and wait for its response. If the other participant has gone offline, then it is moved from the active to inactive list, otherwise the participant will continue to wait on the other.

Each partial result from the results queue will be placed into a solicited or unsolicited list based on if its origin is a participant that was previously notified. The coordinator will then locally examine the unsolicited list and place those zero iteration requests in the recovered list when it is in a nonzero iteration. Workers will examine the unsolicited list and discard any requests whose iteration does not match their own.

The recovery phase begins by an inactive worker coming back online and sending their results to the coordinator, and then waiting to receive the coordinators results and a current iteration notification. The next iteration, the coordinator will look at its recovered list and send the current iteration to the recovering worker, then wait until it receives a resynchronized notification. Upon receiving the current iteration notification, the recovering worker will then go and notify all the other workers in the cluster of its results, and wait for them to response before issuing a resynchronized notification to the coordinator. At which point the recovering worker is fully synchronized with the rest of the system. Once the coordinator receives this notification on its recovery queue, it will move the recovering worker off the inactive and recovery lists and on to the active list before notifying the other workers of its results.

Once the notification and waiting phase have completed, all participants are synchronized and may independently find the next set of centroids by aggregating their partial results from both solicited and unsolicited lists, and then begin the next iteration of the k-Means algorithm. This process will continue until convergence, or the maximum number of iterations has been reached.

### Experiments

Experiments were conducted on four virtual machines running Red Hat 4.8 with 8 GiB of RAM and single tri-core Intel Xeon E312xx class 2.6 GHz processor. Host machines are equipped with Mellanox Connect X-3 Pro EN network interface cards supporting 40 Gbs RoCE. Reported times are wall time as measured by gettimeofday. All C++98 source code compiled using g++ without any optimization flags.

Spark runtime comparison ran on Amazon Web Services Elastic Map Reduce with four m1.large instances running dual core Xeon E5 2.0 GHz processor with 7.5 GiB of RAM supporting 10 Gbps Ethernet. Spark 1.5.2 comparison based on MLlib KMeans.train and reported time measured by System.currentTimeMillis. All Java 1.7 code compiled by javac and packaged by Maven.

#### Transfer Rates

 Figure 3: Transfer rates for variable block sizes up to 64 MiB and fixed 128 MiB payload. Figure 4: Transfer rates for fixed 64 MiB RDMA and 8 KiB message based block sizes and variable payloads up to 7 GiB.

Fig. (3) shows the influence of block size on transfer rate of a fixed payload for RDMA based transfers. As mentioned in the design section, Accelio only supports block sizes up to 64 MiB for RDMA transfers. When the block size is 8 KiB, it takes 117x longer to transfer the same payload than when a block size of 64 MiB is used. This performance is attributed to having to exchange fewer messages for the same volume of data. For the payload of 128 MiB considered, a peak transfer rate of 4.7 Gbps was obtained.

Fig. (4) looks at the relationship of a fixed 64 MiB block size for RDMA transfers up to 7 GiB before exhausting available system resources. A peak transfer rate of 7.7 Gbps is observed which is still significantly less than the 40 Gbps capacity of the network. This would suggest a few possibilities: the Mellanox X-3 hardware was configured incorrectly, there may be network switches limiting the transfer of data, or that there is still room for improvement in the Accelio library.

It is unclear why there should be a kink in performance at 2 GiB. Possible explanations considered the impacts of hardware virtualization, influence of memory distribution on the physical DRAM, and potential Accelio issues. Further experiments are needed to pinpoint the root cause.

To demonstrate the advanced performance of Accelio RDMA based transfers over Accelio messaging based transfers, Fig. (4) includes transfer performance based on 8 KiB messaging. For the values considered, the message based approach was 5.5x slower than the RDMA based approach. These results are a consequence of the number of messages being exchanged for the smaller block size and the overhead of using Accelio.

#### Recovery time

Figure 5: Example of Worker A going offline at iteration 29 and coming back online at iteration 69.

Fig. (5) demonstrates a four node system in which one worker departs and returns to the system forty iterations later. Of note is the seamless departure and reintegration into the system without inducing increased synchronization times for the other participants. For the four node system, average synchronization time was 16 ms. For recovering workers, average reintegrate time was 225 ms with high variance.

#### Runtime performance

Total Percentage
Sharing 721.9 6.7
Computing 8558.7 79.1
Synchronizing 1542.5 14.2
Unaccounted 6.2 0.1
Total 10820.4 100
Table 1: Time in milliseconds spent in each task for 100 iterations, $d = 2, K = 4, N = 10,000,000$.

Looking at the distribution of work, roughly 80% of the work is going towards actual computation that we care about and the remaining 20% what amounts to distributed system bookkeeping. Of that 20% the largest chunk is synchronization suggesting that a more efficient implementation is needed and that it may be worth abandoning sockets in favor of low latency, RDMA based transfers.

 Figure 6: Runtime for varying input based Accelio messaging and sequential kernel, and RDMA and parallel kernel. The latter typically being 2.5x faster Figure 7: Sequential, parallel, distributed versions of k-Means for varying input sizes with Spark runtime for reference.

Runtime of the system for varying inputs is shown in Fig. (6) based on its original Accelio messaging with sequential calculations, and final RDMA transfers with parallel calculations. The latter cuts runtime by 2.5x which isn’t terrible since each machine only has three cores.

The general runtime for different configurations for the final system is shown in Fig. (7). The sequential algorithm is well suited to process inputs less than ten thousand, parallel less than one million, and the distributed system for anything larger. In general, the system performed 40-50x faster than Spark on varying input sizes for a 9 core vs 8 core configuration. These are conservative estimates since the system does 100 fixed iterations, whereas and Spark’s MLlib $\text{k-Means} \lvert \lvert$ implementation stops as soon as possible (typically 5-10 iterations).

Figure 8: Speedup of the system relative to the sequential algorithm.

The overall speedup of the system relative to the sequential algorithm is shown in Fig. (8). For a 12 core system we observe at most a 7.3x runtime improvement (consistent with the performance and sequential vs. parallel breakdowns). In general, in the time it takes the distributed system to process 100 million entries, the sequential algorithm would only be able to process 13.7 million entries. Further optimization is desired.

### Discussion

#### Related Work

The established trend in distributed machine learning systems is to use general purpose platforms such as Google’s MapReduce and Apache’s Spark and Mahout. Low latency, distributed shared memory systems such as FaRM and Grappa are gaining traction as ways to provide better performance thanks to declining memory prices and greater RDMA adoption. The next phase of this procession is exemplified by GPUNet [7], representing systems built around GPUDirect RDMA, and will likely become the leading choice for enterprise customers seeking performance given the rise of deep learning applications on the GPU.

The design of this system is influenced by these trends with the overall parallelization of the k-Means algorithm resembling the map-reduce paradigm and RDMA transfers used here reflecting the trend of using HPC scale technologies in the data center. Given that this system was written in a lower level-language and specialized for the task, it isn’t surprising that it delivered better performance (40-50x) than the leading general purpose system (Apache Spark) written in higher level language.

That said, given the prominence of general purpose distributed machine learning systems, specialized systems for k-Means clustering are uncommon and typically designed for unique environments (e.g., wireless sensor networks [13]). In the cases where k-means is investigated, the focus is on approximation techniques that are statistically bounded [6], and efficient communication patterns [2]; both of these considerations could further be incorporated into this work to realize better performance.

For the barrier, most of the literature focuses on static lists of participants in shared [12] and distributed [4] multi-processor systems with an emphasis on designing solutions for specific networks (e.g., n-dimensional meshes, wormhole-routed networks, etc). For barriers that accommodate transient participants, the focus is on formal verification with [8] and [1] focused on shared and distributed systems respectively. No performance benchmarks could be found for a direct comparison to this work, however, [5] presents a RDMA over InfiniBand based dissemination barrier that is significantly faster ($\mu s$ vs $ms$) suggesting opportunities for future RDMA use.

#### Future Work

##### RDMA

Accelio provides a convenient API for orchestrating RDMA reads and writes at the expense of performance. Accelio serves a niche market and its maturity reflects that reality. There are opportunities to make Accelio more robust and capable for transferring large chunks of data. If this avenue of development is not fruitful, alternative libraries such as IBVerbs may be used in an attempt to obtain advertised transfer rates for RDMA enabled hardware.

As implemented, the distribution of tasks over RDMA is linear. This was done to take advantage of sequential read speeds from disk, but it may not scale well for larger clusters. Additional work could be done to look at pull based architectures where participants perform RDMA reads to obtain their tasks rather than the existing push based architecture built around RDMA writes. As well as exploring these paradigms for distributed file systems to accommodate larger datasets.

Calculations presented were multi-core oriented, but k-Means clustering can be done on the GPU as well. Technologies such as NVIDIA’s GPUDirect RDMA and AMD’s DirectGMA present an opportunity to accelerate calculations by minimizing expensive memory transfers between host and device in favor of between devices alone. Provided adequate hardware resources, this could deliver several orders magnitude faster runtime performance.

##### Kernel

As demonstrated in the experiments section, overall runtime is heavily dominated by the actual k-Means iteration suggesting refinements in the implementation will lead to appreciable performance gains. To this effect, additional work could be put into the sequential kernel to make better use of SIMD features of the underlying processor. Alternatively, the OpenBLAS library could be used to streamline the many linear algebra calculations since they already provides highly optimized SIMD features. Accelerators like GPUs, FPGAs, and MICs could be investigated to serve as alternative kernels for future iterations of the system.

##### Barrier

The barrier is by no means perfect and it leaves much to be desired. Beginning with allowing recovery of a failed coordinator, allowing reunion of network partitions, dynamic worker onboarding, and suspending calculations when too much of the system goes offline to ensure quality results. Once these enhancements are incorporated in to the system, work can be done to integrate the underlying protocols away from a socket based paradigm to that of RDMA. In addition, formal verification of the protocol would guide its use into production and on to larger clusters. The system works reasonably well on a small cluster, but further work is needed to harden it for modern enterprise clusters.

##### Overall System

As alluded to in the background, shuffling of data could be added to the system so that end users do not have to do so beforehand. Similarly, more sophisticated scheduling routines could be investigated to ensure an even distribution of work on a system of machines with varying capabilities.

While the k-Means algorithm served as a piloting example, work could be done to further specialize the system to accommodate a class of unsupervised clustering algorithms that fit the general map-reduce paradigm. The end goal is to provide a plug-n-play system with a robust assortment of optimized routines that do not require expensive engineers to setup, exploit, and maintain as is the case for most existing platforms.

Alternatively, the future of the system could follow the evolution of other systems favoring a generalized framework that enable engineers to quickly distribute arbitrary algorithms. To set this system apart from others, the emphasis would be on providing an accelerator agnostic environment where user specified programs run on whatever accelerators (FPGA, GPU, MIC, multi-core, etc.) are present without having to write code specifically for those accelerators. Thus saving time and resources for enterprise customers. Examples of this paradigm are given by such libraries as CUDAfy.NET and Aparapi for translating C# and Java code to run on arbitrary GPUs.

#### Conclusion

This work described a cross-platform, fault-tolerant distributed system that leverages the open source library Accelio to move large volumes of data via RDMA. Transfer rates up to 7.7 Gbps out of the desired 40 Gbps were observed and it is assumed if flaws in the library were improved, that faster rates could be achieved. A synchronization protocol was discussed that supports transient workers in the coordinated calculation of k-Means centroids. The protocol works well for small clusters, offering 225 ms reintegration times without significantly affecting other participants in the system. Further work is warranted to harden the protocol for production use. Given the hardware resources available the distributed system was 7.3x out of the desired 12x faster than the sequential alternative. Compared to equivalent data loads, the system demonstrated 40-50x better runtime performance than Spark MLlib’s implementation of the algorithm suggesting the system is a competitive alternative for k-Means clustering.

### References

[1] Shivali Agarwal, Saurabh Joshi, and Rudrapatna K Shyamasundar. Distributed generalized dynamic barrier synchronization. In Distributed Computing and Networking, pages 143-154. Springer, 2011.

[2] Maria-Florina F Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median clustering on general topologies. In Advances in Neural Information Processing Systems, pages 1995-2003, 2013.

[3] Mike Estlick, Miriam Leeser, James Theiler, and John J Szymanski. Algorithmic transformations in the implementation of k-means clustering on recongurable hardware. In Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays, pages 103-110. ACM, 2001.

[4] Torsten Hoeer, Torsten Mehlan, Frank Mietke, and Wolfgang Rehm. A survey of barrier algorithms for coarse grained supercomputers. 2004.

[5] Torsten Hoeer, Torsten Mehlan, Frank Mietke, and Wolfgang Rehm. Fast barrier synchronization for inniband/spl trade. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, pages 7-pp. IEEE, 2006.

[6] Ruoming Jin, Anjan Goswami, and Gagan Agrawal. Fast and exact out-of-core and distributed k-means clustering. Knowledge and Information Systems, 10(1):17-40, 2006.

[7] Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Amir Wated, Emmett Witchel, and Mark Silberstein. Gpunet: Networking abstractions for gpu programs. In Proceedings of the International Conference on Operating Systems Design and Implementation, pages 6-8, 2014.

[8] Duy-Khanh Le, Wei-Ngan Chin, and Yong-Meng Teo. Verication of static and dynamic barrier synchronization using bounded permissions. In Formal Methods and Software Engineering, pages 231-248. Springer, 2013.

[9] Xiaobo Li and Zhixi Fang. Parallel clustering algorithms. Parallel Computing, 11(3):275-290, 1989.

[10] Stuart P Lloyd. Least squares quantization in pcm. Information Theory, IEEE Transactions on, 28(2):129-137, 1982.

[11] Mellanox Technologies Ltd. Accelio – the open source i/o, message, and rpc acceleration library.

[12] John M Mellor-Crummey and Michael L Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS), 9(1):21-65, 1991.

[13] Gabriele Oliva, Roberto Setola, and Christoforos N Hadjicostis. Distributed k-means algorithm. arXiv, 2013.

[14] Renato Recio, Bernard Metzler, Paul Culley, Je Hilland, and Dave Garcia. A remote direct memory access protocol specication. Technical report, 2007.

[15] Mario Zechner and Michael Granitzer. Accelerating k-means on the graphics processor via cuda. In Intensive Applications and Services, 2009. INTENSIVE’09. First International Conference on, pages 7-15. IEEE, 2009.

Written by lewellen

2015-12-22 at 1:43 pm

## Android ecosystem on Windows 7

with one comment

### Introduction

This past spring I decided to take a dive into mobile platforms and decided I’d get my feet with a little Android development. I’d done some Java development in the past and and with reports of increased market share, especially among tablets, I figured Android was the right platform to get started with. Since it’s always good to document things, here’s a rundown of what was needed to get a basic development environment working on Windows 7. As I continue to explore and learn more, I’ll continue to update this list with more information.

### JDK SE 1.6

First thing that I needed to download was the Java Development Kit (JDK) from Oracle. The JDK provides all of the necessary components to get started and run Java based applications.

### Android SDK

Next core development kit to download is the Android SDK from Google. The SDK has all of the Android specific tools and libraries to develop and test applications that are meant to run on the Android platform.

#### AVD Manager

The second thing I setup were two virtual devices using the AVD Manager. My main interest is in developing applications for tablets, so I created a Honeycomb virtual device with a gig of memory and a reduced screen size. Also decided to create a Gingerbread virtual device with half a gig of memory without any screen size restrictions. Both come in handy to make sure that any app I write will work well on handhelds and tablets.

### IntelliJ Community Edition

To do my development I decided to go with IntelliJ as my editor. A lot of people out there use Eclipse; I used to use it a long time ago, but decided that I wanted to try something new. Installation went smoothly and the only custom configuration dealt with specifying the location of the JDK and the Android SDK.

### Acer A500 USB Driver

Developing an application against a virtual device is fun and all, but nothing beats testing and using an application on an actual device. I decided to go with an Acer A500 since it was the right mix of features, cost and responsiveness. To deploy an application to the device, I did have to go and download a USB driver from Acer’s website so that the Android Debug Bridge (ADB) from the Android SDK would recognize the device when it was plugged in to my machine. Past that, it was a seamless experience of getting IntelliJ to deploy an application to the device and for me to begin hands-on testing.

### Wrap-up

Overall, getting started with the Android Platform has gone very smoothly. I dove into all of this without reading any documentation and the process was fairly self explanatory. I’m looking forward to learning more about the platform and ultimately trying to get a product out on the App Market.

Written by lewellen

2012-03-01 at 11:09 pm

## Thoughts on delivering software

### Preface

A lot can happen in four years. We get used to its passing as a milestone marking a transition from one stage of life to another. I was thinking about the last four some odd years since I’d graduated from college and came to realize that while I’ve learned a lot since then, I still have a ways to go. One recurring question I have floating around my head all this time is how to deliver a complete product. At the end of college I had the theoretical side of the story, but I only had limited exposure doing internships and consulting.

My first job after college was working for a small company focusing on healthcare. Our development group was just a couple months old and I was the second developer brought on board. We started out as a group of cowboys doing what was necessary to retain customers and bring on new ones. Despite the chaos, lack of requirements and ever shifting priorities, we managed to make the company successful and a leader in its industry. During that time, I feel that I have learned a lot about how to deliver software in the face of uncertainty.

More importantly, I’ve learned how be more than just a developer and grow into someone trusted to autonomously improve the business with my vision and execution. I’m going to cover a few choice topics that I feel have had the biggest impact on how I deliver software. Things that I feel answer that burning question of how to deliver a complete product. This isn’t intended to be comprehensive, but it should give someone where I was four years ago a little insight into how to get a quality product out the door even when it feels like it’ll never happen.

### Communication

When you are sitting in front of a computer, it’s easy to get excited and starting writing whatever happens to pop into mind and race its way down your fingers. Few things are more satisfying as a developer than falling into flow and building something to completion- going through the motions of refactoring, testing and becoming convinced that you’ve written something solid. You eagerly present your product to the stakeholders. They look back at you and say, “this isn’t at all what we asked for”.

It is easy to make this mistake. Not taking the time to fully understand the stakeholder’s needs and more importantly, making sure they do too. After all, the biggest reason why software fails to meet expectations is because expectations weren’t established and requirements weren’t well defined. The source of these failures lies in failures in communication. It should seem obvious, but any software endeavor that involves more than one person relies on thorough communication to execute a shared goal- to bring a solution to market in a timely manner.

To make sure that you avoid this pitfall, take the time to work with your stakeholders. An hour of talking and planning can prevent you from wasting a day of misdirected development time. Focus on the domain problem and its solution. How the solution is implemented, is largely irrelevant compared to its completeness of requirements. To ensure that you and your stakeholder have the same view of the product, produce a dossier similar to a user’s guide that outlines how the solution behaves. Once defined, produce a quick prototype and present it to the stakeholder. Upon their approval, implement the product.

### Simplicity

Now, simplicity doesn’t imply mediocre no more than complexity implies extraordinary. As Antoine de Saint-Exupéry said in Wind, Sand and Stars, “Il semble que la perfection soit atteinte non quand il n’y a plus rien à ajouter, mais quand il n’y a plus rien à retrancher“. Simplicity is simply the path to perfection and removing that which is unnecessary. Much of what can be said about delivering software can be said about managing complexity. Markets change and so too do the demands on products. As a consequence, so too do the demands on code. How much that code has to change depends largely on how well it was written.

Complexity encroaches on software in one of two ways. A product that grows organically without any direction and second through poor coding practices and design. This lack of direction comes from stakeholders committing to one-off functionality without thinking how it contributes to a uniform product. In the second example, an unorthodox solution results in a product that can never be changed and a bad set of abstractions make it difficult to change anything.

What can be done to avoid complexity and enable simplicity? Focus on established principles of the field. Design components that are modular, capable only of doing one thing and doing it well. Interfaces between systems should be flexible in what they accept and resistant to change. Code should assert its assumptions and not be ad hoc in its solutions. Above all, code should be written with meaningful abstractions that represent the problem and take into consideration how that problem can mutate or be extended to other problems. Doing this intelligently, results in maintainable, reliable and predictable code -but more importantly- code that ships.

### Introspection

Being a software professional requires a certain degree of introspection. Solving problems requires us to build complex models in our minds and walk around them to discover insights that lead to solutions. The ability to maintain these mental models is an important factor to our ability to deliver a product. Despite the amount of time spent in one’s mind, there is often a failure to think about how one carries out his or her job. Lucid requirements, minimal designs and all the theory in the world are useless if you don’t know how to keep up with the demands placed upon you.

What happens when you fail to keep up? You might think to spend an extra hour a day trying to catch up. This works for a bit, but not for long. You end up giving something up for that extra hour and usually it means sacrificing sleep, relationships, interests or your health. Doing this for prolonged periods leads to falling further and further behind until deadlines are racing past you, pounds piling up on your body and work invading every aspect of your life. Ultimately, it all leads to burnout.

To keep up with demands, and to get that product out the door, it is important to know yourself and set limitations. Figure out if you are most effective in the morning or night. Understand if you like switching between projects or like focusing on one project at a time. At the end of each day, write down what you’ve accomplished and what you hope to accomplish the next day. A list makes it easy for you to see that things are getting done and what you need to be working on. If you find yourself behind due to reasons beyond your control, negotiate with the stakeholders to extend the timeline. Nothing is ever set in stone.

### Epilogue

I doubt there is an all encompassing answer to how to deliver a complete product. Each year I have a different take on the question and no doubt I will continue to in the years to come. I am certain however, that building a product takes more than just a few clever software professionals, it takes a broad set of skills and abilities from people of different backgrounds. Sales, management, marketing, accounting, information technology and many other disciplines contribute to the delivery of a complete product. Building a product with these groups in mind makes it easier to deliver a complete product but also one that is successful.

Do each of these groups have the tools they need to provide a quality service to customers? Is there a well defined process for supporting the product and escalating issues to development? Does sales and marketing have access to usage data to know what customers are doing with the product? Does accounting and payroll have the information they need to send out bills and payments for services rendered? A product can really only be called complete if there is a set of processes, services and tools supporting it.

Take the time to really learn your company and industry. It’s easy enough to just be someone who spends all day writing code, but its just as easy to spend time learning what the business is really all about and what problems it aims to solve in its industry. Spend the time to learn what other groups in the company are up to and how their efforts are contributing to the success of your product. The technical mind is a great platform for spotting opportunities that can result in better products and ultimately a better company. When you have the big picture, there really isn’t anything holding you back from delivering a complete product.

Written by lewellen

2012-01-01 at 8:00 am

## Haskell ecosystem on Windows XP

It’s been fun watching the Haskell ecosystem evolve into a mature system over the years. Seems that about every three months it takes a leap forward and I usually find myself uninstalling what I previously had and putting the latest and greatest on my laptop. To save myself some time in the future, I’ve compiled this post as a reference of basic “stuff” that is good to have for doing Haskell development on a Windows XP machine.

It used to be that you had to go find a number of different applications to get packages, compile source code and generate documents (among a handful of other things), then a group of folks identified an opportunity and put together a platform for getting new users a place to start. The result is the Haskell Platform.

After installing, you’ll want to go to the command line and run the following commands to make sure that you’ve got the latest version of Cabal and to make sure that it has the latest package list:

C:\cabal install cabal-install
C:\cabal update


### Leksah

Many developers are probably used to having a quality Integrated Development Environment (IDE) to work with and the Haskell Community’s answer is Leksah. Leksah is still fairly green and has a ways to go before being comparable to Eclipse or Visual Studio, but nonetheless, Leksah is packed with plenty of goodies that will make for easy development of packages and modules for distribution on Cabal.

It is best to use the installer from the Leksah website. Once you’ve installed the latest, you’ll need to run the following from the command-line

C:\ghc-pkg recache


So that the packages on the system (those provided by GHC) will show up when you have the browse pane open.

### Gtk2hs

If you plan on doing any Graphical User Interfaces (GUIs), then you’ll want to get the Haskell binding to the GTK+ library. On the page there should be an “All-in-one bundle”- for the purpose of the following, I went with version 2.20.

After extracting the bundle on the machine, make sure that the path you extracted the bundle at along with the child bin directory is added to the PATH environment variable.

Then from the command-line run the following and you should be able to include GTK in your project:

C:\cabal install gtk


### HOpenGL

I’ve been working on some basic game programming and I’ve done some stuff in the past with OpenGL so I decided to give the Haskell bindings a try. Windows doesn’t natively ship with the OpenGL library, so you’ll need to get it from here.

Then get the following packages off of Cabal:

c:\cabal install glut
C:\cabal install opengl


### Wrap-up

I haven’t done a dry run to test all of the above, so if you follow all of the above and come across a problem, post the solution in the comments. I’ll continue to update this post as I identify any problems or come across additional “stuff” that falls into the must-have category.

Written by lewellen

2010-08-01 at 8:00 am

## Getting a Development Network Started

I’ve been in the process of revamping my home network and have been spending time thinking about how I’d like to set up my development environment for my personal projects that I might try and sell one day. Most of the work I do is C#, ASP.NET, PHP, Java, Haskell,… the list goes on, so I’ve been thinking about what kind of solution will allow me to build against different OSes and platforms. The following is a rundown of this thought process and the considerations and decisions made in bringing up my network.

Anytime I do any planning there are a handful of main points that I try to keep focused on:

• Cost – How much money I’m interested in putting into a project?
• Time – Total time investment to bring up the hardware and software.
• Quality – Am I after a quick solution or one that will have long lasting use?
• Portability – How easy would it be to move the system to another platform.
• Extensibility – How easy it is to add on new entities.

I know I want to keep the project under 1000 USD- this cost includes hardware, operating system licenses, software licenses, utility costs over the lifetime of the solution and opportunity costs etc. Time wise, I want something that could take an hour a day to get setup, working and tweaked to perfection over the course of a week. I want something that is going to be flexible enough to be useful five years down the road, but is also capable of doing what I want today, thus I want a solution that doesn’t look like it was thrown together with duct tape but also doesn’t look like I spent years planning it out. It is important for me to be able to port my solution to new hardware quickly and effortlessly as well as add on new elements as needed. This is especially important if my environment crashes as a result of hardware failure or software malice.

There are a variety of ecosystems that I’m used to working with. the following table summarizes the residents of each:

Ecosystem Type of Work Runtimes IDE Databases Web Server
.net Websites, Web Services, Clients, Services .net Framework 4.0, Mono 2.6 Visual Studio 2008 Express SQL Server 2008 Express Internet Information Services 7.0
Java Clients JVE 1.6 Eclipse Galileo MySQL Community Server 5.1 Apache 2.2
Haskell Clients NA vim, yi, Leksah MySQL Community Server 5.1 Apache 2.2

These ecosystems also have corresponding environments for dealing with source control, build automation, bug tracking and project tracking. Given the ecosystems I’m interested in, I’ve decided on Subversion, Hudson, Mantis and twiki to manage all of my projects’ artifacts.

Having reviewed what a lot of other shops have done, there are a couple common elements that most development networks incorporate:

• repo– Source repository and OS specific Database(s).
• dev– IDEs and frameworks for the development of OS specific applications. (per developer machine, often dual booting)
• web– OS specific web servers hosting platform specific websites.
• build– Dedicated build machines for producing assemblies for specific platforms and OSes.

Given these elements, the languages, platforms and operating systems that I’m interested in, I’ve settled on an ideal network that looks like the following:

Important to notice the use of virtualization here. Being able to store a series of ISOs for each of the element groups on a NAS makes it easy to bring up new instances and make backups, thus satisfying my time and portability criteria. As well as satisfying my cost criteria as it is cheaper to purchase a beefy box running several virtual machines than it is to purchase several physical machines. At the time of writing (2009-12), most quad core machines run for about 1000 USD and, and most (consumer) NASes cost about 100-300 USD. Now, I could load up the server up with 1TB storage for an additional 100 USD. Of course, this means I have a single point of failure- which in a home environment may not be a huge deal. In terms of money these two are the main consumers at the hardware level. Everything else already exists on the network.

Lets take a look at the estimated costs:

Item Quantity Amount (USD) Extended Amount (USD)
Dell Studio XPS 8000 1 1100.00 1100.00
1TB Western Digital MyBook 1 120.00 120.00
Windows XP Professional Licenses 4 100.00 400.00
1620.00

That total amount is a little more that initially desired. It is possible to collapse vm-windows-dev and time-thief down to one machine, and then collapse vm-windows-repo, vm-windows-web and vm-windows-build down to a single machine resulting in a cost savings of 300.00 USD from reduced operating system license costs. If the NAS is removed from the picture, that brings us down to 1200.00 which is about as close as I’m going to get to my initial target of 1000 USD. Not sure if this is the final setup that I will end up going with, per usual, I’ll update this post with any new developments.

Written by lewellen

2010-02-01 at 8:00 am

## Solutions to some Microsoft interview questions

with one comment

A couple of weeks ago, someone on reddit posted a link to a collection of Microsoft Interview Questions. As someone who interviewed with them while in college, I was curious to see what kind question they were asking folks. After reviewing the list, I thought I’d work out a few that looked interesting:

Imagine an analog clock set to 12 o’clock. Note that the hour and minute hands overlap. How many times each day do both the hour and minute hands overlap? How would you determine the exact times of the day that this occurs?

$\displaystyle \text{Let } \theta_{h}(t) = \frac{2\pi}{43200}t , \theta_{m}(t) = \frac{2\pi}{3600} t \in \mathbb{R} \to \mathbb{R}, t \in [0, 86400)$

$\displaystyle \lvert \theta_{h}(t) - \theta_{m}(t) \rvert \equiv 0 \pmod{2\pi} \Rightarrow \frac{39600}{155520000} 2\pi t \equiv 0 \pmod{2\pi}$

$\displaystyle \Rightarrow t_{k} = \frac{43200}{11} k, k \in [0, 22)$

Thus*: 12:00:00 AM, 01:05:27 AM, 02:10:54 AM, 03:16:21 AM, 04:21:49 AM, 05:27:16 AM, 06:32:43 AM, 07:38:10 AM, 08:43:38 AM, 09:49:05 AM, 10:54:32 AM, 12:00:00 PM, 01:05:27 PM, 02:10:54 PM, 03:16:21 PM, 04:21:49 PM, 05:27:16 PM, 06:32:43 PM, 07:38:10 PM, 08:43:38 PM, 09:49:05 PM and 10:54:32 PM.

* Floor the conversion from $t$ to hour, minute and second of day.

Pairs of primes separated by a single number are called prime pairs. Examples are 17 and 19. Prove that the number between a prime pair is always divisible by 6 (assuming both numbers in the pair are greater than 6). Now prove that there are no ‘prime triples.’

Assuming:
$\text{Let } n, r \in \mathbb{N}$
$p = 6n \pm r , r \in \{ 0, 2 \} \Rightarrow 3|p$
$p = 6n \pm r , r \in \{ 3, 4 \} \Rightarrow 2|p$
$p = 6n \pm r , r \in \{1, 5 \} \Rightarrow 1|p \wedge p|p$
$\therefore (\forall p > 3) \in \mathbb{P}, p = 6n \pm r, r \in \{1, 5\}$

Twin Primes:
$\displaystyle \text{Let } p = 6n - 1, q = 6n + 1 \in \mathbb{P}$
$\displaystyle n, r = \frac{p+q}{2} \in \mathbb{N}$
$\displaystyle\Rightarrow r = \frac{6n - 1 + 6n + 1}{2}$
$\Rightarrow r = 6n$
$\therefore 6|r$

Prime Triples:
$\displaystyle \text{Let } u = 6n - 1, v = 6n + 1, v = 6m - 1, w = 6m + 1 \in \mathbb{P}$
$\displaystyle n < m \in \mathbb{N}$
$\displaystyle 6n + 1 = 6m - 1 \Rightarrow m - n = \frac{1}{3}$
$\displaystyle \frac{1}{3} \notin \mathbb{N}$
$\displaystyle \therefore \not \exists u, v, w \in \mathbb{P}$

There are 4 women who want to cross a bridge. They all begin on the same side. You have 17 minutes to get all of them across to the other side. It is night. There is one flashlight. A maximum of two people can cross at one time. Any party who crosses, either 1 or 2 people, must have the flashlight with them. The flashlight must be walked back and forth, it cannot be thrown, etc. Each woman walks at a different speed. A pair must walk together at the rate of the slower woman’s pace.

For example if Woman 1 and Woman 4 walk across first, 10 minutes have elapsed when they get to the other side of the bridge. If Woman 4 then returns with the flashlight, a total of 20 minutes have passed and you have failed the mission. What is the order required to get all women across in 17 minutes? Now, what’s the other way?

One way:

1. Woman 1 and 2 Cross, woman 1 returns. 3 minutes total.
2. Woman 3 and 4 Cross, woman 2 returns. 15 minutes total.
3. Woman 1 and 2 Cross. 17 Minutes total.

Other way:

1. Woman 1 and 2 Cross, woman 2 returns. 4 minutes total.
2. Woman 3 and 4 Cross, woman 1 returns. 11 minutes total.
3. Woman 1 and 2 Cross. 17 minutes total.

If you had an infinite supply of water and a 5 quart and 3 quart pail, how would you measure exactly 4 quarts?

1. Fill 5 quart pail with 5 quarts of water from source.
2. Fill 3 quart pail with 3 quarts of water from 5 quart pail.
3. Empty 3 quart pail.
4. Empty 5 quart pail containing 2 quarts of water into 3 quart pail.
5. Fill 5 quart pail with 5 quarts of water from source.
6. Fill 3 quart pail with water from 5 quart pail.
7. 5 quart pail now contains 4 quarts of water.

Suppose you have an array of 1001 integers. The integers are in random order, but you know each of the integers is between 1 and 1000 (inclusive). In addition, each number appears only once in the array, except for one number, which occurs twice. Assume that you can access each element of the array only once. Describe an algorithm to find the repeated number. If you used auxiliary storage in your algorithm, can you find an algorithm that does not require it?

$f(A) = \displaystyle\sum_{k = 0}^{|A| - 1} A_{k} -\frac{|A|(|A|-1)}{2}$

public int duplicateNumber(int[] A) {
int count = 0;
for(int k = 0; k < A.Length; k++)
count += A[k];
return count - (A.Length * (A.Length - 1) >> 1);
}


Count the number of set bits in a number. Now optimize for speed. Now optimize for size.

$\displaystyle \text{Let } m = \sum_{k = 0}^{\infty} A_{k} 2^{k}, A_{k} \in [0,1], m \in \mathbb{N}_{0}$
$\displaystyle f(n) = \sum_{k=0}^{\infty}A_{k}$
$\displaystyle f(n) = \begin{cases} \left( n - n \oslash 2 \right) + f\left( n \oslash 2 \right) & n \neq 0 \\ 0 & \text{otherwise} \end{cases}$

public int bitsUsed(int n) {
int count = 0;
while(n != 0) {
count += n & 1;
n >>= 1;
}

return count;
}


Written by lewellen

2008-08-24 at 6:00 pm

## Automated inheritance refactoring

Over time, we all gain experience and insight in deciding how to generalize the principle actors of our problem domains into different layers of abstraction to minimize the amount of code we write. After awhile, this practice becomes tedious and one begins to wonder whether or not it is feasible to have an integrated refactoring (in our IDE of choice) that will review our class inheritance hierarchy and suggest to us an idealized class inheritance hierarchy. After all, an optimal solution would be able to identify relationships that our minds would otherwise not be able to fathom- as is the case as our code bases and development costs increase.

What we’re really interested in is an automated version of Extract Superclass that takes in a collection of classes and returns some number of superclasses. Thing is, we are back to our original problem where we have a collection of classes that we want to generalize. If we continue this process as a feedback loop we end up eventually with an irreducible collection of classes. The original collection along with the rest of the generated collections constitute the idealized class inheritance hierarchy.

During the mid-nineties this topic was the subject of a handful of masters and doctorate programs but since then, there seems to be a significant gap in the flow of academic material on the subject. Furthermore, there seems to be little recent evidence from industry to indicate work being done on a solution. Is such a solution simply not needed? Are the outputs of solutions impractical? What is the blocking element to adoption? The only evidence of this idea getting any momentum is given by Guru– a solution targeted at the Self programming language. Where are the solutions for C++, Java, C# et al.?

Written by lewellen

2008-07-20 at 8:00 pm

Posted in Software Engineering

Tagged with