Archive for December 2012
Parallel Merge Sort in Java
Introduction
This past November I was a pretty busy getting settled into a new job and trying to balance life’s other priorities. With a new job also came a new technology stack and while I’ll continue to do C# development in my free time, I’m going to be going back to doing Java development after a seven year hiatus. Before starting the new job, I decided to refresh my memory of the language’s finer details when it comes to generics and threading. So, I decided to implement something simple and settled on a parallel implementation of merge sort. This article is going to focus on making use of Java’s various features and evaluating the theoretical and empirical run time performance of the sequential and parallel versions of the algorithm.
Sequential Approach
Specification
Given a list of values, the list is sorted by employing a divide and conquer method that partitions the list into two (roughly) equal sized partitions, followed by recursively sorting each partition and then merging the two resulting sorted partitions into the final sorted list.
Pseudocode

Time Complexity
In terms of time complexity, the algorithm is on the order of . To show this, observe that the input size, , is divided into to two equal parts, , followed by a merge operation, . This leads to the recurrence relation given by . By induction, the recurrence relation is reduced to . Observing that the merge function is on the order , i.e., , then the expression reduces further to and . As the number of subdivisions increases, eventually will be reduced to . As such, let which implies which implies , and thus . Therefore,
Implementation
In attempting to implement a generic version of merge sort there were a few matters that needed to be addressed. First, the type being sorted required an order relation to be specified so that the merge operation could take place. This is facilitated by restricting the type parameter T
to Comparable<T>
. Secondly, I had forgotten that you can’t initialize arrays of generics in Java like you can in C# ^{[1]}. To workaround this limitation, I settled on specifying the desired operations over implementations of the List<T>
interface. Finally, since the List<T>
interface makes no guarantees that its implementations provide (near) constant time reading or writing of elements from the list, an additional generic parameter, L
, was added so that only those implementations of the List<T>
and RandomAccess
^{[2]} interfaces could use this implementation of merge sort. The rest of the implementation is a near facsimile of the pseudocode.
package com.wordpress.antimatroid; import java.util.List; import java.util.RandomAccess; public interface IListOperation <T, L extends List<T> & RandomAccess> { L execute(); }
package com.wordpress.antimatroid; import java.util.ArrayList; import java.util.List; import java.util.RandomAccess; public class CopyListOperation <T, L extends List<T> & RandomAccess> implements IListOperation<T, L> { private final L source; private final int length, initialIndex; public CopyListOperation(L source, int length, int initialIndex) { if(source == null) throw new IllegalArgumentException("source must be nonnull."); if(length < 0) throw new IllegalArgumentException(String.format( "length, %d, must be greater than or equal to zero.", length )); if(initialIndex < 0) throw new IllegalArgumentException(String.format( "initialIndex, %d, must be greater than or equal to zero.", initialIndex )); if(initialIndex + length > source.size()) throw new IllegalArgumentException(String.format( "initialIndex, %d, + length, %d, must be less than or equal to source.size(), %d.", initialIndex, length, source.size() )); this.source = source; this.length = length; this.initialIndex = initialIndex; } @Override public L execute() { L destination = (L) new ArrayList<T>(length); for(int i = 0; i < length; i++) destination.add(i, source.get(initialIndex + i)); return destination; } }
package com.wordpress.antimatroid; import java.util.ArrayList; import java.util.List; import java.util.RandomAccess; public class MergeListOperation <T extends Comparable<T>, L extends List<T> & RandomAccess> implements IListOperation<T, L> { private final L a, b; public MergeListOperation(L a, L b) { if(a == null) throw new IllegalArgumentException("a must not be null."); if(b == null) throw new IllegalArgumentException("b must not be null."); this.a = a; this.b = b; } @Override public L execute() { int length = a.size() + b.size(); L c = (L) new ArrayList<T>(length); int i = 0, j = 0; for(int k = 0; k < length; k++) { if(i < a.size() && j < b.size()) { if(a.get(i).compareTo(b.get(j)) <= 0) { c.add(k, a.get(i++)); } else { c.add(k, b.get(j++)); } } else if (i < a.size() && j >= b.size()) { c.add(k, a.get(i++)); } else if (i >= a.size() && j < b.size()) { c.add(k, b.get(j++)); } else { break; } } return c; } }
package com.wordpress.antimatroid; import java.util.List; import java.util.RandomAccess; public class MergeSortListOperation < T extends Comparable<T>, L extends List<T> & RandomAccess > implements IListOperation<T, L> { private final L a; public MergeSortListOperation(L a) { if(a == null) throw new IllegalArgumentException("a must not be null."); this.a = a; } @Override public L execute() { if(a.size() <= 1) return a; CopyListOperation<T, L> leftPartition = new CopyListOperation<T, L>(a, (a.size() / 2) + a.size() % 2, 0); CopyListOperation<T, L> rightPartition = new CopyListOperation<T, L>(a, (a.size() / 2), (a.size() / 2) + a.size() % 2); MergeSortListOperation<T, L> leftSort = new MergeSortListOperation<T, L>(leftPartition.execute()); MergeSortListOperation<T, L> rightSort = new MergeSortListOperation<T, L>(rightPartition.execute()); MergeListOperation<T, L> merge = new MergeListOperation<T, L>(leftSort.execute(), rightSort.execute()); return merge.execute(); } }
Run Time Analysis
Noting that the theoretical time complexity is , inputs of the form will yield a curve. Taking the logarithm of which will give . Observing that as increases the linear term will dominate the expression. As a result, the curve should look near linear in logarithmic space with the exception of small values of . Which means that conducting a linear least squares regression of the empirical run times in logarithmic space will yield a satisfactory approximation to the theoretical time complexity.
To verify that the implementation follows the theoretical time complexity, increasing values of were used to generate lists containing random values. These lists were then sorted and the System.nanoTime()
before and after values were used to determine the elapsed time. These values were collected and a total of 50 identical trails were conducted on an Intel Core i72630QM CPU @ 2.00 GHz based machine with 6.00 GB RAM.
As presented in the plot, the regressed linear model in logarithmic space yields a satisfactory theoretical curve whose relative error to the empirical curve diminishes to zero as the input size increases.
Parallel Approach
Specification
The parallel implementation operates under the premise that the divide portion of merge sort can be easily parallelized by sorting one partition on the present thread and sorting the other partition on a secondary thread. Once the secondary thread has completed, then the two threads join, and consequently, the two sorted lists are merged. To avoid copious thread creation, whenever the input size is less than a threshold, , the sequential version of the algorithm is used.
This process can be easily visualized below where each lefthand branch is the originating thread processing the first partition, each righthand branch is the secondary thread processing the second partition and the junction of those edges represents the consequent merge operation after the secondary thread as joined back in with the originating thread.
Time Complexity
The introduction of parallelism changes the original recurrence relation to the following:
Assuming, , and that there is no asymptotic difference in sorting the first and second partition, then the time complexity is on the order of . To see this, observe that the recurrence relation becomes under the presented assumtions. Following the same process of induction as in the sequential case, the recurrence relation reduces to and is simplified further under the assumption to . Observing that the sum is a finite geometric series leads to and under the same reduction argument as before to . Thus, the time complexity of the parallel merge sort specified is
Assuming , then the time complexity of the algorithm is still on the order . Thus, for various values of and , the time complexity is between .
Implementation
In terms of parallelizing the sequential implementation, an addition interface, IThreadedListOperation
was added to provide a BeginOperation
, EndOperation
asynchronous programming model found in the .net world. After looking around the Java world, I didn’t encounter a preferred idiom, so I went with what I knew.
As I mentioned in the sequential approach, the original data structures were going to be arrays which have a guarantee of providing thread safe reads, but not necessarily thread safe writes. To avoid the issue all together, I decided that the IListOperations
should always return a new List<T>
instance so that only one thread at a time would be reading or manipulating that memory. Since I knew my implementation would not be sharing IListOperations
between threads, I decided not to gold plate the implementation with synchronization constructs. If in the future such ability were required, I would go back and modify the code accordingly.
For the parallel implementation I took advantage of the fact that method arguments are evaluated lefttoright ^{[3]} to save one some space, but if the specification ever changed, then it would be more appropriate to move the out the leftSort.execute()
and rightSort.executeEnd()
methods up a line to form a more explicit operation.
package com.wordpress.antimatroid; import java.util.List; import java.util.RandomAccess; abstract public class IThreadedListOperation <T, L extends List<T> & RandomAccess> implements Runnable, IListOperation<T, L> { private Thread thread; public void executeBegin() { if(thread != null) throw new IllegalStateException(); thread = new Thread(this); thread.start(); } public L executeEnd() { if(thread == null) throw new IllegalStateException(); try { thread.join(); } catch (InterruptedException e) { } return getResult(); } public L execute() { if(thread != null) throw new IllegalStateException(); run(); return getResult(); } abstract protected L getResult(); }
package com.wordpress.antimatroid; import java.util.List; import java.util.RandomAccess; public class MergeSortThreadedListOperation <T extends Comparable<T>, L extends List<T> & RandomAccess> extends IThreadedListOperation<T, L> { private final L a; private L b; private final int threshold; public MergeSortThreadedListOperation(L a) { this(a, 1024); } public MergeSortThreadedListOperation(L a, int threshold) { if(a == null) throw new IllegalArgumentException("a must be nonnull."); if(threshold <= 0) throw new IllegalArgumentException("threshold must be greater than zero."); this.a = a; this.threshold = threshold; } @Override public void run() { if(a.size() <= 1) { b = a; return; } if(a.size() <= threshold) { MergeSortListOperation<T, L> mergeSort = new MergeSortListOperation<T, L>(a); b = mergeSort.execute(); return; } CopyListOperation<T, L> leftPartition = new CopyListOperation<T, L>(a, (a.size() / 2) + a.size() % 2, 0); MergeSortThreadedListOperation<T, L> leftSort = new MergeSortThreadedListOperation<T, L>(leftPartition.execute()); CopyListOperation<T, L> rightPartition = new CopyListOperation<T, L>(a, (a.size() / 2), (a.size() / 2) + a.size() % 2); MergeSortThreadedListOperation<T, L> rightSort = new MergeSortThreadedListOperation<T, L>(rightPartition.execute()); rightSort.executeBegin(); MergeListOperation<T, L> merge = new MergeListOperation<T, L>(leftSort.execute(), rightSort.executeEnd()); b = merge.execute(); } @Override protected L getResult() { return b; } }
Run Time Analysis
Noting that the time complexity for the parallel approach is , a simple linear least squares regression of the empirical run times in normal space will yield a satisfactory approximation to the theoretical time complexity.
The trial methodology used in the sequential run time analysis is used once again to produce the following plot. Note that it begins at 2048 instead of 1. This was done so that only the parallel implementation was considered and not the sequential implementation when the input size is .
As presented in the plot, the regressed linear model in logarithmic space yields a satisfactory theoretical curve whose relative error to the empirical curve diminishes to zero as the input size increases.
Threshold Selection
As a thought experiment, it makes sense that as the threshold approaches infinity, that there is no difference between the sequential implementation and parallel one. Likewise, as the threshold approaches one, then the number of threads being created becomes exceedingly large and as a result, places a higher cost on parallelizing the operation. Someplace in the middle ought to be an optimal threshold that yields better run time performance compared to the sequential implementation and a pure parallel implementation. So a fixed input size should produce a convex curve as a function of the threshold and hence have a global minimum.
Conducting a similar set of trials as the ones conducted under the analysis of the sequential run time give a fully parallel and sequential curve which to evaluate where the optimal threshold resides. As the plot depicts, as the threshold approaches one, there is an increase in the processing taking the form of a convex curve. As the threshold exceeds the input size, then the sequential approach dominates. By conducting a Paired TTest against the means of the two curves at each input size, 1024 was determined to be the optimal threshold based on the hardware used to conduct the trials. As the input size grows, it is evident that for thresholds less than 1024, the sequential approach requires less time and afterwards, the parallel approach is favorable.
Conclusion
In comparing the sequential and parallel implementations it was observed that the specified parallel implementation produced as much as a 2.65 factor improvement over the specified sequential implementation for megabyte sized lists.
Larger sized lists exhibited a declining improvement factor. It is presumed that as the input size grows that the amount of memory being created is causing excessive paging and as a result increasing the total run time and consequently reducing the improvement factor. To get around this limitation, the algorithm would need to utilize an inplace approach and appropriate synchronization constructs put into place to guarantee thread safety.
From a theoretical point of view, the improvement factor is the ratio of the run time of the sequential implementation to the parallel implementation. Using the time complexities presented, . Taking the limit as the input size grows to infinity gives . So if there is any upper bound to the improvement factor it should be purely technical.
Footnotes
[1] This design decision is discussed in §4.7 of the Java Language Specification (3rd Edition) on reifiable types.
[2] The only two java.util classes providing this guarantee are ArrayList
and Vector
. Both of which implement the interface RandomAccess
which is intended indicate that the class provides the (near) constant reading and writing of elements.
[3] The lefttoright order of operations is specified by §15.7.4 of the Java Language Specification (3rd Edition). Also worth noting the specification recommends against the practice of relying on this convention however in §15.7:
… It is recommended that code not rely crucially on this specification. Code is usually clearer when each expression contains at most one side effect, as its outermost operation, and when code does not depend on exactly which exception arises as a consequence of the lefttoright evaluation of expressions.