# Antimatroid, The

thoughts on computer science, electronics, mathematics

## GPU Accelerated Expectation Maximization for Gaussian Mixture Models using CUDA

C, CUDA, and Python source code available on GitHub

### Introduction

Gaussian Mixture Models [1, 435-439] offer a simple way to capture complex densities by employing a linear combination of $K$ multivariate normal distributions, each with their own mean, covariance, and mixture coefficient, $\pi_{k}$, s.t. $\sum_{k} \pi_{k} = 1$.

$\displaystyle p( x ) = \sum_{k = 1}^{K} \pi_{k} p(x \lvert \mu_k, \Sigma_k)$

Of practical interest is the learning of the number of components and the values of the parameters. Evaluation criteria, such as Akaike and Bayesian, can be used to identify the number of components, or non-parametric models like Dirichlet processes can be used to avoid the matter all together. We won’t cover these techniques here, but will instead focus on finding the values of the parameters given sufficient training data using the Expectation-Maximization algorithm [3], and doing so efficiently on the GPU. Technical considerations will be discussed and the work will conclude with an empirical evaluation of sequential and parallel implementations for the CPU, and a massively parallel implementation for the GPU for varying numbers of components, points, and point dimensions.

### Multivariate Normal Distribution

The multivariate normal distribution With mean, $\mu \in \mathbb{R}^d, d \in \mathbb{N}_1$, and symmetric, positive definite covariance, $\Sigma \in \mathbb{R}^{d \times d}$, is given by:

$\displaystyle p( x \lvert \mu, \Sigma ) = \frac{1}{\sqrt{(2\pi)^d \lvert \Sigma \rvert }} \exp{\left( - (x - \mu)^{T} \Sigma^{-} (x - \mu) / 2 \right)}$

From a computational perspective, we will be interested in evaluating the density for $N$ values. Thus, a naive implementation would be bounded by $\mathcal{O}\left(N d^4\right)$ due to the matrix determinate in the normalization term. We can improve upon this by computing the Cholesky factorization, $\Sigma = L L^T$, where $L$ is a lower triangular matrix [6, 157-158]. The factorization requires $\mathcal{O} \left ( d^3 \right )$ time and computing the determinate becomes $\mathcal{O} \left (d \right)$ by taking advantage of the fact that $\det\left(L L^T\right) = \det(L)^2 = \prod_i L_{i,i}^2$. Further, we can precompute the factorization and normalization factor for a given parameterization which leaves us with complexity of the Mahalanobis distance given by the quadratic form in the exponential. Naive computation requires one perform two vector matrix operations and find the inverse of the covariance matrix with worst case behavior $\mathcal{O} \left (d^3\right)$. Leveraging the Cholesky factorization, we’ll end up solving a series of triangular systems by forward and backward substitution in $\mathcal{O} \left (d^2\right)$ and completing an inner product in $\mathcal{O} \left (d\right)$ as given by $L z = x - \mu$, $L^T z = y$, and $(x-\mu)^T y$. Thus, our pre-initialization time is $\mathcal{O} \left (d^3 \right)$ and density determination given by $\mathcal{O} \left (N d^2 \right)$. Further optimizations are possible by considering special diagonal cases of the covariance matrix, such as the isotropic, $\Sigma = \sigma I$, and non-isotropic, $\Sigma_{k,k} = \sigma_k$, configurations. For robustness, we’ll stick with the full covariance.

$\displaystyle \log p( x \lvert \mu, \Sigma ) = - \frac{1}{2} \left( d \log 2\pi + \log \lvert \Sigma \rvert \right ) - \frac{1}{2} (x - \mu)^{T} \Sigma^{-} (x - \mu)$

To avoid numerical issues such as overflow and underflow, we’re going to consider $\log p(x \lvert \mu, \Sigma)$ throughout the remainder of the work. For estimates of the covariance matrix, we will want more samples than the dimension of the data to avoid a singular covariance matrix [4]. Even with this criteria satisfied, it may still be possible to produce a singular matrix if some of the data are collinear and span a subspace of $\mathbb{R}^d$.

### Expectation Maximization

From an unsupervised learning point of view, GMMs can be seen as a generalization of k-means allowing for partial assignment of points to multiple classes. A possible classifier is given by $k^{*} = \arg\max_k \, \log \pi_{k} + \log p(x \lvert \mu_k, \Sigma_k)$. Alternatively, multiple components can be used to represent a single class and we argmax over the corresponding subset sums. The utility of of GMMs goes beyond classification, and can be used for regression as well. The Expectation-Maximization (EM) algorithm will be used to find the parameters of of the model by starting with an initial guess for the parameters given by uniform mixing coefficients, means determined by the k-means algorithm, and spherical covariances for each component. Then, the algorithm iteratively computes probabilities given a fixed set of parameters, then updating those parameters by maximizing the log-likelihood of the data:

$\displaystyle \mathcal{L} \left( \mathcal{D} \lvert \mu, \Sigma \right) = \sum_{n = 1}^{N} \log p(x_n) = \sum_{n=1}^{N} \log{ \left [ \sum_{k = 1}^{K} \pi_{k} p \left( x_n \lvert \mu_k, \Sigma_k \right ) \right ] }$

Because we are dealing with exponents and logarithms, it’s very easy to end up with underflow and overflow situations, so we’ll continue the trend of working in log-space and also make use of the “log-sum-exp trick” to avoid these complications:

$\displaystyle \log p( x ) = a + \log \left[ \sum_{k = 1}^{K} \exp{ \left( \log \pi_{k} + \log p(x \lvert \mu_k, \Sigma_k) - a \right ) } \right ]$

Where the $a$ term is the maximum exponential argument within a stated sum. Within the expectation stage of the algorithm we will compute the posterior distributions of the components conditioned on the training data (we omit the mixing coefficient since it cancels out in the maximization steps of $\mu_k$ and $\Sigma_k$, and account for it explicitly in the update of $\pi_k$):

$\displaystyle \gamma_{k, n} = \frac{ p \left ( x_n \lvert \mu_k, \Sigma_k \right ) }{ p(x) } \qquad \Gamma_k = \sum_{n=1}^{N} \gamma_{k, n}$

$\displaystyle \log \gamma_{k, n} = \log p \left ( x_n \lvert \mu_k, \Sigma_k \right ) - \log p(x) \qquad \log \Gamma_k = a + \log \left [ \sum_{n=1}^{N} \exp{ \left( \log \gamma_{k, n} - a \right )} \right ]$

The new parameters are resolved within the maximization step:

$\displaystyle \pi_{k}^{(t+1)} = \frac{ \pi_{k}^{(t)} \Gamma_k }{ \sum_{i=1}^{K} \pi_{i}^{(t)} \Gamma_i } \qquad \log \pi_{k}^{(t+1)} = \log \pi_{k}^{(t)} + \log \Gamma_k - a - \log \left [ \sum_{i=1}^{K} \exp{ \left( \log \pi_{i}^{(t)} + \log \Gamma_i - a \right )} \right ]$

$\displaystyle \mu_k^{(t+1)} = \frac{ \sum_{n=1}^{N} x_n \gamma_{n, k} }{ \Gamma_k } \qquad \mu_k^{(t+1)} = \frac{ \sum_{n=1}^{N} x_n \exp{ \log \gamma_{n, k} } }{ \exp{ \log \Gamma_k } }$

$\displaystyle \Sigma_k^{(t+1)} = \frac{ \sum_{n=1}^{N} (x_n - \mu_k^{(t+1)}) (x_n - \mu_k^{(t+1)})^T \gamma_{n, k} }{ \Gamma_k }$

$\displaystyle \Sigma_k^{(t+1)} = \frac{ \sum_{n=1}^{N} (x_n - \mu_k^{(t+1)}) (x_n - \mu_k^{(t+1)})^T \exp \log \gamma_{n, k} }{ \exp \log \Gamma_k }$

The algorithm continues back and forth between expectation and maximization stages until the change in log likelihood is less than some epsilon, or a maximum number of user specified iterations has elapsed.

### Implementations

Sequential Per iteration complexity given by $\mathcal{O}\left(2 K N d^2 + K N d + 2K + N + K d^3\right)$. We expect $d \le K < N$ because too many dimensions leads to a lot of dead space and too many components results in overfitting of the data. Thus, the dominating term for sequential execution is given by $\mathcal{O}\left(2 K N d^2 \right)$.

Parallel There are two natural data parallelisms that appear in the algorithm. The calculation of the $\mathcal{L}$ and $\gamma$ across points, while the probability densities and parameter updates have natural parallelisms across components. Each POSIX thread runs the full iterative algorithm with individual stages coordinated by barrier synchronization. Resulting complexity is given by $\mathcal{O}\left(\frac{2}{P} d^2 K N \right)$ for work coordinated across $P$ processors.

Massively Parallel The parallel implementation can be taken and mapped over to the GPU with parallelism taken across points and components depending on the terms being computed. There are several types of parallelism that we will leverage under the CUDA programming model. For the calculation of $\log p\left(x | \mu_k, \Sigma_k \right)$ we compute each point in parallel by forming a grid of one dimensional blocks, and use streams with event synchronization to carry out each component in parallel across the streaming multiprocessors. Calculation of the loglikelihood and $\log \gamma_{n,k}$ is done by computing and storing $\log p(x)$, then updating the storage for $\log p\left(x|\mu_k,\Sigma_k\right)$, and then performing a parallel reduction over $\log p(x)$ to produce the loglikelihood. Parallel reductions are a core tasks are implemented by first standardizing the input array of points to an supremum power of two, then reducing each block using shared memory, and applying a linear map to the memory so that successive block reductions can be applied. Several additional approaches are discussed in [5]. Once the loglikelihood is computed, the streams are synchronized with the host and the result is copied from the device back to the host. To compute $\log \Gamma_k$, $\log \gamma_{n,k}$ is copied to a working memory and a maximum parallel reduction is performed. The resulting maximum is used in a separate exponential map for numerical stability when computing the parallel reduction of each component to yield $\log \Gamma_k$. Updates to the mean and covariances are performed by mapping each term to a working memory allocated for each component’s stream and executing a parallel reduction to yield the updated mean and covariance. Once all component streams have been synchronized, the mixture coefficients and Cholesky decompositions of the covariances is computed with a single kernel invocation parallel in the number of components.

The main design consideration was whether or not use streams. For larger numbers of components, this will result in improved runtime performance, however, it comes at the cost of increased memory usage which limits the size of problems an end user can study with the implementation. Because the primary design goal is performance, the increase in memory was favorable to using less memory and executing each component sequentially.

To optimize the runtime of the implementation nvprof along with the NVIDIA Visual Profiler was used to identify performance bottlenecks. The original implementation was a naive port of the parallel C code which required frequent memory transfers between host and device resulting in significant CUDA API overhead that dominated the runtime. By transferring and allocating memory on the device beforehand, this allowed the implementation to execute primarily on the GPU and eliminate the API overhead. The second primary optimization was using streams and events for parallelization of the component probability densities and parameter updates in the maximization step. In doing so, this allowed for a $K$ fold reduction since the components calculations would be performed in parallel. The next optimization step was to streamline the parallel reductions by using block reductions against fast shared block memory minimizing the number of global memory writes instead of performing iterated reductions against sequential addressing that preformed global memory reads and writes for each point. The final optimization step was to used pinned host memory to enable zero-copy transfers from DRAM to the GPU over DMA.

### Evaluation

To evaluate the implementations we need a way of generating GMMs and sampling data from the resulting distributions. To sample from a standard univariate normal distribution one can use The Box-Muller transform, Zigguart method, or Ratio-of-uniforms method [7]. The latter is used here due to its simplicity and efficiency. Sampling from the multivariate normal distribution can by done by sampling a standard normal vector $\eta \sim \mathcal{N}(0 ,I_d)$ and computing $\mu + \Sigma^{1/2} \eta$ where $\Sigma^{1/2}$ can be computed by Eigendecomposition, $\Sigma^{1/2} = Q \Delta^{1/2} Q^{-}$, or Cholesky factorization, $\Sigma = L L^T, \Sigma^{1/2} = L$. The latter is used since it is more efficient. The GMM describes a generative process whereby we pick a component at random with probability given by its mixture coefficient and then sample the underlying $\mathcal{N}(\mu_k, \Sigma_k)$ distribution, and perform this process for the desired number of points.

The matter of generating GMMs it more interesting. Here we draw $\pi_i = X_i / \sum_{j} X_j$ for $X_i \sim \mathcal{U}(0, 1)$, alternatively, one could draw $\pi \sim \text{Dir}(\alpha)$. Means are drawn by $\mu \sim \mathcal{N}(0, a I_d)$ with $a > 1$ so that means are relatively spread out in $\mathbb{R}^{d}$. The more exciting prospect is how to sample the covariance matrix. This is where the Wishart distribution, $\Sigma \sim W(I_d, d, n)$ for $n > d - 1$, comes in handy. The Wishart distribution is a model of what the sample covariance matrix should look like given a series of $n$ $x_i \sim \mathcal{N}(0, I_d)$ vectors. Based on a $\mathcal{O}\left(d^2\right)$ method by [8], [9] gives an equally efficient method for sampling $\Sigma^{1/2} = L$ by letting $L_{i,i} \sim \chi^2(n - i)$ and $L_{i,j} \sim \mathcal{N}(0, 1)$ for $0 \le i < d$ and $0 \le j < i$.

To evaluate the performance of the different implementations, the wall clock time taken to run the algorithm on a synthetic instance was measured by varying each of the $N$, $K$, and $d$ parameters while holding the other two fixed. From an end user perspective wall clock time is preferable to the time the operating system actually devoted to the problem since wall clock time is more valuable. There will be variability in the results since each instance requires a different number of iterations for the log likelihood to converge. Tests were conducted on a Xeon 1245 v5 3.5 Ghz system with 32GB of memory and an NVIDIA GTX 1060 6GB graphics card with 1280 cores.

Since the parameter space is relatively large Figures 2-5 look at varying one parameter will fixing the others to demonstrate the relative merits of each approach. When the number of points dominates the CUDA approach tends to be 18x faster; the Parallel approach tends to be 3x faster when the dimension is high; and CUDA is suitable when the num of components is high giving a 20x improvement relative to the sequential approach. Thus, when dealing with suitably large datasets, the CUDA based implementation is preferable delivering superior runtime performance without sacrificing quality.

It is important to note that the results obtained from the CUDA solution may differ to those the sequential and parallel approaches. This is due to nondeterministic round off errors associated with executing parallel reductions compared to sequential reductions [2], and differences in the handling of floating point values on the GPU [10], notably, the presence of fused multiple add on NVIDIA GPUs which are more accurate than what is frequently implemented in CPU architectures. The following two synthetic data sets illustrate typical results of the three schemes:

### Conclusion

This work demonstrated the utility of using NVIDIA GPUs to train Gaussian mixture models by the Expectation Maximization algorithm. Speedups as high as 20x were observed on synthetic datasets by varying the number of points, components, and data dimension while leaving the others fixed. It is believed that further speedups should be possible with additional passes, and the inclusion of metric data structures to limit which data is considered during calculations. Future work would pursue more memory efficient solutions on the GPU to allow for larger problem instance, and focus on providing higher level language bindings so that it can be better utilized in traditional data science toolchains.

### References

1. Bishop, C. M. Pattern recognition and machine learning. Springer, 2006.
2. Collange, S., Defour, D., Graillat, S., and Lakymhuk, R. Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Computing 49 (2015), 83-97.
3. Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the eme algorithm. Journal of the royal statistical society. Series B (methodological) (1977), 1-38.
4. Fan, J., Liao, Y., and Liu, H. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal 19, (2016) C1-C32.
5. Harris, M. Optimizing cuda. SC07: High Performance Computing with CUDA (2007).
6. Kincaid, D., and Cheney, W. Numerical analysis: mathematics of scientific computing. 3 ed. Brooks/Cole, 2002.
7. Kinderman, A. J., and Monahan, J. F. Computer generation of random variables using the ratio of uniform deviates. ACM Transactions on Mathematical Software (TOMS) 3, 3 (1977), 257-260.
8. Odell, P., and Feiveson, A. A Numerical procedure to generate a sample covariance matrix. Journal of the American Statistical Association 61, 313 (1966), 199-203.
9. Sawyer, S. Wishart distributions and inverse-wishart sampling. URL: http://www.math.wustl.edu/~sawyer/hmhandouts/Wishart.pdf (2007).
10. Whitehead, N., and Fit-Florea, A. Precision and performance: Floating point and ieee 754 compliance for nvidia gpus. rn(A + B) 21., 1 (2011), 18749-19424.

Written by lewellen

2017-04-22 at 11:36 am

## A Greedy Approximation Algorithm for the Linear Assignment Problem

Starting today, I will be posting some of the related source code for articles on GitHub.

### Introduction

The Linear Assignment Problem (LAP) is concerned with uniquely matching an equal number of workers to tasks, $n$, such that the overall cost of the pairings is minimized. A polynomial time algorithm was developed in the late fifties by [6], and further refined by [9], called the Hungarian method. Named so after the work of Hungarian mathematicians König and Egerváry whose theorems in the 1930s form the basis for the method. While the Hungarian Method can solve LAP instances in $\mathcal{O}\left(n^3\right)$ time, we wish to find faster algorithms even if it means sacrificing optimality in the process. Here we examine a greedy $\alpha$-approximation algorithm with $\mathcal{O}\left(n^2 \log n \right)$ runtime in terms of its approximation factor and compare it empirically to the Hungarian method.

### Linear Assignment Problem

\displaystyle \begin{aligned} C_n = \min & \sum_{i=1}^{n} \sum_{j=1}^{n} M_{i,j} x_{i,j} \\ s.t. & \sum_{i=1}^{n} x_{i,j} = 1, \quad j = 1, \ldots, n \\ & \sum_{j=1}^{n} x_{i,j} = 1, \quad i = 1, \dots, n \label{eqn:lap} \end{aligned}

The above linear program has cost, $M \in \mathbb{Z}_{+}^{n \times n}$, and assignment, $x \in \lbrace 0,1 \rbrace^{n \times n}$, matrices that specify the terms of the LAP. This is equivalent to finding a perfect matching in a weighted bipartite graph. A minimal cost may have several possible assignments, but we are only interested in finding just one. It is assumed that no one worker can do all jobs more efficiently by themselves than the distributing work across all workers. Likewise, if the costs are thought of as durations, then the minimum cost is the minimum sequential rather than parallel time taken to complete the tasks.

From a practical point of view, we may relax the integral constraint on $M$ and allow all positive real-valued costs. For instances where there are more jobs than workers, and vice versa, dummy entries valued greater than the existing maximum may be added. Minimizing the cost is the default objective, but the maximum cost can be found by finding the optimal assignment for $M^{\prime}_{i,j} = M_{max} - M_{i,j}$, then finding the cost relative to $M$.

### Algorithms

Brute Force Rather than using the mathematical programming or graph theoretic representation of the problem, we can instead view the problem as finding the assignment that minimizes the cost out of all possible assignments:

$\displaystyle \pi^{*} = \underset{\pi \in \Pi_n}{\arg\min} \sum_{i=1}^{n} M_{i, \pi_i}$

There are $n!$ such assignments which can be produced using an iterative version of Heap’s algorithm [5] in $\mathcal{O}\left(n!\right)$ time assuming one does differential scoring (opposed to calculating the score for each permutation which would result in an $\mathcal{O}\left(n^2 (n-1)!\right)$ algorithm.)

Random The random algorithm selects a permutation $\pi \in \Pi_n$ uniformly from the set of all possible assignment permutations in $\mathcal{O}\left(n\right)$ time using the Fisher-Yates shuffle [4]. This obviously does not produce an optimal or near-optimal solution, but serves as a straw man to compare other results.

Greedy The greedy heuristic continues to cover the row and column of the smallest uncovered entry in the cost matrix until all entries are covered. The resulting set of entries then constitutes the assignment of workers to jobs. An inefficient $\mathcal{O}\left(n^3\right)$ algorithm can be used to find the smallest entry every iteration, or a more efficient result of $\mathcal{O}\left(n^2 \log n\right)$ can be obtained through the use of a sorted, array indexed hybrid mesh and queue. Let $\texttt{QNode}$ represent a tuple consisting of row, column, and value; the previous entry in the matrix $\le$ this value, and the next entry in this matrix $\ge$ this value; and the $\texttt{QNode}s$ (left, above, right, below) that are adjacent to this node.

Algorithm 1 A greedy algorithm for the LAP.

• $\textbf{procedure } \textsc{Greedy}(M)$
• $A[i] \gets \bot \text{ for } i = 0 \ldots n - 1$
• $Q[i] \gets \texttt{QNode} \text{ for } i = 0 \ldots n^2 - 1$
• $\textsc{LinkMesh}(Q)$ // Adjacent node left, above, right, below properties
• $\textsc{Sort}(Q)$ // Sort in ascending order by node value
• $\textsc{LinkQueue}(Q)$ // Adjacent node previous and next properties
• $\qquad Q_{min} \gets Q[0]$
• $\textbf{while } Q_{min} \neq nil \textbf{ do}$
• $A[ Q_{min} \rightarrow row ] \gets Q_{min} \rightarrow col$
• $Q_{min} \gets \textsc{DeleteNode}(Q, Q_{min})$ // Deletes row and col of $Q_{min}$
• $\textbf{end while}$
• $\qquad \textbf{return } A$

Allocating and linking for assignment is $\mathcal{O}\left(n\right)$; mesh $\mathcal{O}\left(n^2\right)$; queue $\mathcal{O}\left(2n^2\log n + n^2\right)$. Therefore, initialization requires $\mathcal{O}\left(n^2 \log n\right)$ time. The body of the loop requires a constant time assignment of worker to job, and $\mathcal{O}\left(2k - 1\right)$ time to remove the row and column from a $k \times k$ matrix using a modified depth first search. Thus, the loop itself accounts for $\mathcal{O}\left(n^2\right)$ time. The resulting time complexity is therefore $\mathcal{O}\left(n^2 \log n\right) \square$.

$\displaystyle \begin{pmatrix} 62 & 31 & 79 & \fbox{6} & 21 & 37 \\ 45 & 27 & 23 & 66 & \fbox{9} & 17 \\ 83 & 59 & 25 & 38 & 63 & \fbox{25} \\ \fbox{1} & 37 & 53 & 100 & 80 & 51 \\ 69 & \fbox{72} & 74 & 32 & 82 & 31 \\ 34 & 95 & \fbox{61} & 64 & 100 & 82 \\ \end{pmatrix} \quad \begin{pmatrix} 62 & 31 & 79 & \fbox{6} & 21 & 37 \\ 45 & 27 & 23 & 66 & \fbox{9} & 17 \\ 83 & 59 & \fbox{25} & 38 & 63 & 25 \\ \fbox{1} & 37 & 53 & 100 & 80 & 51 \\ 69 & 72 & 74 & 32 & 82 & \fbox{31} \\ 34 & \fbox{95} & 61 & 64 & 100 & 82 \\ \end{pmatrix}$

Breaking ties for the minimum uncovered value can result in different costs. This drawback is shown in the above example were choosing $25$ at $(3,6)$ yields a minimum cost of $174$, where as the one at $(3, 3)$ gives a minimum cost of $167$. The next progression in the design of the greedy algorithm would be to try all minimum positions and keep the top $k$ performing paths.

Hungarian The general idea behind the Kuhn-Munkres algorithm is that if we are given an initial assignment, we can make further assignments and potentially reassign workers until all workers have been tasked with a job. The high-level sketch of the algorithm starts with an initial assignment. While we have jobs that are unassigned, we look for qualified workers, ie, the zero entries. If a worker is already assigned to a job, but is also qualified for another, then we prime the alternative and continue to the next qualified worker, but if that is the only job the worker is qualified for, then we’d like to reassign any other worker already tasked to that job. This leads to a natural ripple effect represented by an alternating path of starred and primed entries. In Munkres’ paper [9] “starred” zero’s represent assignments of workers to jobs, and “primed” zero’s are alternative assignments. By flipping the bits of the path, we reassign workers to their alternative tasks while ensuring the assignment continues to be minimal by construction. After assigning as many workers as we have to, we then deduct the lowest cost to create a new qualified worker. Thus, every iteration we are guaranteed to make positive progress towards our goal of finding an optimal assignment. This scheme results in the worst case $\mathcal{O}\left(n^3\right)$ time to complete.

Algorithm 2 The Hungarian method for the LAP.

• $\textbf{procedure } \textsc{HungarianMethod}(M)$
• $M_{i,j} \gets M_{i,j} - \min_j M_{i,j} \text{ for } i = 0 \ldots n - 1$
• $M_{i,j} \gets M_{i,j} - \min_i M_{i,j} \text{ for } j = 0 \ldots n - 1$
• Star the first uncovered zero in row $i$, cover the corresponding column $j$ for $i = 0 \ldots n - 1$
• $\textbf{while }$ All columns not covered
• $\textbf{while }$ Uncovered zeros
• Prime the current uncovered zero
• $\textbf{if }$ There’s a starred zero in this row
• Uncover the starred zero’s column and cover the row
• $\textbf{else }$
• Find an alternating augmented path from the primed zero
• Unstar the starred zeros on the path and star the primed zeros on the path
• Remove all the prime markings and cover all stared zeros
• $\textbf{break}$
• $\textbf{end if}$
• $\textbf{end while}$
• $\textbf{if }$ Found path
• $\textbf{continue}$
• $\textbf{end if}$
• $M^* = \min M_{i,j}$ over all uncovered $i, j$
• $M_{i,j} = M_{i,j} - M^*$ for all uncovered columns $j$
• $M_{i,j} = M_{i,j} + M^*$ for all covered rows $i$
• $\textbf{end while }$
• $\textbf{return}$ Starred zeros // These are all the assignments
• $\textbf{end procedure}$

To further illustrate the algorithm, consider the following example where starred entries are denoted by red, and primed entries by green:

### Analysis

The prevailing convention in the literature is to look at the approximation factor, $\alpha$, to determine how close the results of an approximation algorithm are to optimal [10]. Here this ratio is the expected minimum cost assignment of the algorithm under test to the same quantity given by the expected minimum assignment cost. Let $M_{i,j} \sim \text{Exp}(1)$ be an $n \times n$ a standard exponential random cost matrix. We resort to the exponential distribution for its ease of analyis and prominence in related literature. Cf. the works of [7], [8] for analysis based on $M_{i,j} \sim \mathcal{U}(0,1)$.

Exponential Distribution Properties Let $X \sim \text{Exp}(\lambda)$ have cumulative distribution function $F_X(x) = 1 - \exp{\left(-\lambda x\right)}$ and expectation $\mathbb{E}(X) = \lambda^{-}$. The distribution demonstrates the memoryless property for expectations $\mathbb{E}(X \lvert X > a) = \mathbb{E}(X) + a$. Define the order statistic $X_{1:n} = \min \lbrace X_{1}, \ldots, X_{n} \rbrace$ to be the minimum of $n$ draws from $\text{Exp}(\lambda)$. $X_{1:n} \sim \text{Exp}(n \lambda)$ [2] with expectation $\mathbb{E}(X_{1:n}) = \left(n \lambda\right)^-$. If $Y_n = \sum_{i = 1}^{n} X_i$ then $Y_n \sim \text{Gamma}(n, \lambda)$ with expectation $\mathbb{E}(Y_n) = n \lambda^{-}$.

Expected Minimum Cost The expected minimum assignment cost for $M$ is given by [1]:

$\displaystyle \mathbb{E}(C_n) = \sum_{k = 1}^{n} \frac{1}{k^2} = H_{n}^{(2)}$

Which is the generalized harmonic number of order two and converges to $\zeta(2) = \pi^2/6$. For the generalized harmonic numbers, $H_{n}^{(k)}$, $\lim_{k\to\infty} H_{n}^{(k)} = \zeta(k)$ for $k > 1$.

Greedy The minimum value of an $n \times n$ matrix is given by the order statistic $M_{1:n^2}$ with expectation $\mathbb{E}(M_{1:n^2}) = n^{-2}$. The expected value of the minimum cost assignment is not just $\sum_{i=0}^{n-1} (n-i)^{-2}$ because the expectation doesn’t take into account the previous iteration’s minimum value. To accomplish this we make use of the memoryless property of the exponential distribution to observe that the expected difference in minimums between iterations is the expected minimum value given by $M_{i:k^2}$. If we add up all these differences we get the expected minimum value of the k’th iteration; summing all these expectations then yields the expected minimum cost assignment:

$\displaystyle \mathbb{E}(C_n) = \sum_{i=0}^{n-1} \sum_{j=0}^{i} \frac{1}{(n - j)^2} = \sum_{j=0}^{n-1} \frac{(n-j)}{(n-j)^2} = \sum_{j=0}^{n-1} \frac{1}{n-j} = H_n$

This is the harmonic number of order one which does not converge. The resulting approximation factor is:

$\displaystyle \alpha_n = \frac{H_n}{H_n^{(2)}}$

Random The random algorithm will simply select an assignment permutation, so we are just adding up $n$ $\text{Exp}(1)$ distributed random variables leading to an expected cost of:

$\displaystyle \mathbb{E}(C_n) = \sum_{i=1}^n \mathbb{E}(M_{i, \pi_i}) = n$

And approximation factor:

$\displaystyle \alpha_n = \frac{n}{H_n^{(2)}}$

From this analysis one concludes that the greedy algorithm has an unbounded approximation factor that grows significantly slower than that of randomly selecting assignments.

### Evaluation

To illustrate the preceding results, Figure 1 shows the approximation factor for the greedy algorithm implementations against the derived approximation factor. The simulated results are based on 120 $n \times n$ standard exponentially distributed matrices for $1 \le n \le 1000$. Using the same conventions for the approximation factor, Figure 2 illustrates the runtime characteristics of the algorithms after rejecting outliers due to system fluctuations. Results obtained from source code compiled with -O3 flags and ran on a Xeon E3-1245 v5 3.5 Ghz system with 32 GBs of 2133Mhz DDR4 RAM. The algorithms coincide with the theoretical time complexities as shown in Table 2.

Solver MSE
GREEDY-EFFICIENT 0.002139
GREEDY-NAIVE 0.014161
HUNGARIAN 0.232998
Table 1: Mean square error of fitted model to mean runtime for each solver. Models given by the corresponding time complexity. Fit by Levenberg-Marquardt.

### Summary

Brute Random Greedy Hungarian
Complexity $\mathcal{O}\left(n!\right)$ $\mathcal{O}\left(n\right)$ $\mathcal{O}\left(n^2 \log n\right)$ $\mathcal{O}\left(n^3\right)$
$\alpha_n$ 1 $n / H_n^{(2)}$ $H_n / H_n^{(2)}$ 1
Table 2: Merits of each approach.

Exact solutions can be delivered by the brute method when a handful of workers are being considered, and the Hungarian method should be considered for all other instances. Approximate solutions can be provided by the greedy algorithm with logarithmic degeneracy while providing a linear factor improvement over the Hungarian method. For inputs greater than those considered, the parallel Auction algorithm [3] is a suitable alternative and the subject of future work.

### References

1. Aldous, D. J. The $\zeta(2)$ limit in the random assignment problem. Random Structures & Algorithms 18, 4 (2001), 381-418.
2. Balakrishnan, N., and Rao, C. Handbook of statistics 16: Order statistics-theory and methods, 2000.
3. Bertsekas, D. P. The auction algorithm: A distributed relaxation method for the assignment problem. Annals of operation research 4, 1 (1988), 105-123.
4. Durtenfeld, R. Algorithm 235: random permutation. Communications of the ACM 7, 7 (1964), 420.
5. Heap, B. Permutations by interchanges. The Computer Journal 6, 3 (1963), 293-298.
6. Kuhn, H. W. The hungarian method for the assignment problem. Naval research logistics quarterly 2, 1-2 (1955), 83097.
7. Kurtzberg, J. M. On approximation methods for the assignment problem. Journal of the ACM (JACM) 9, 4 (1962), 419-439.
8. Steele, M. J. Probability and statistics in the service of computer science: illustrations using the assignment problem. Communications in Statistics-Theory and Methods 19, 11 (1990), 4315-4329.
9. Munkres, J. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5, 1 (1957), 32-38.
10. Williamson, D. P., and Shmoys, D. B. The design of approximation algorithms. Cambridge university press, 2011.

Written by lewellen

2017-03-21 at 11:12 am

## Using GPUs to solve Spatial FitzHugh-Nagumo Equation Numerically

### Introduction

#### Modeling Action Potentials

In neurophysiology, we wish to model how electrical impulses travel along the complex structures of neurons. In particular, along the axon since it is the principle outbound channel of the neuron. Along the axon are voltage gated ion channels in which concentrations of potassium and sodium are exchanged. During depolarization, a fast influx of sodium causes the voltage to gradually increase. During repolarization, a slow outflux of potassium causes the voltage to decrease gradually. Because the potassium gates are slow to close, there is a negative voltage dip and recovery phase afterwards called hyperpolarization.

Attempts to provide a continuous time model of these phases began with work in the 1950s by Hodgkin and Huxley [HH52]. The duo formulated a four-dimensional nonlinear system of ordinary differential equations based on an electrical circuit model of giant squid axons. This landmark contribution was later simplified in the 1960s by FitzHugh [Fit61]. Here FitzHugh casted the problem in terms of Van der Pol oscillators. This reformulation allowed him to explain the qualitative behavior of the system in terms of a two-dimensional excitation/relaxation state-space model. The following year Nagumo [NAY62] extended these equations spatially to model the desired action potential propagation along an axon, and demonstrated its behavior on analog computers. This spatial model will be the focus of this work, and a more comprehensive account of these developments can be found in [Kee02].

#### FitzHugh-Nagumo Equation

The Spatial FitzHugh-Nagumo equation is a two-dimensional nonlinear reaction-diffusion system:

 $\displaystyle \frac{\partial }{\partial t} v = \frac{\partial^2 }{\partial^2 x} v + v \cdot (1 - v) \cdot (v - a) - w \qquad \frac{\partial }{\partial t} w = \epsilon \left( v - \gamma w \right )$ $\displaystyle v(x, 0) = v_0 \qquad w(x, 0) = 0 \qquad \frac{\partial}{\partial x} v(0, t) = 0 \qquad \frac{\partial}{\partial x} v(L, t) = 0$ (1)

Here $v(x, t)$ represents the action potential (voltage) along the axon, and $w(x, t)$ the recovery of the system. Constant $0 < a 0$ is absent from the literature, and $\epsilon > 0$ is inversely proportional to recovery time.

#### Mathematical Analysis

The system does not admit a general analytic solution. Mathematical analysis of the system is concerned with the existence and stability of traveling waves with [AK15] providing a thorough account of these aspects. Other analyses are concerned with the state-space model and the relationship of parameters with observed physiological behavior. In particular: self-excitatory, impulse trains, single traveling wavefronts and impulses, doubly traveling divergent wave impulses, and non-excitatory behavior that returns the system to a resting state. Since the FitzHugh-Nagumo equations are well understood, more recent literature is focused on higher dimensional and stochastic variants of the system which [Tuc13] discusses in detail.

#### Numerical Analysis

A survey of the literature revealed several numerical approaches to solving the FitzHugh-Nagumo equations consisting of a Finite Element method with Backward Differentiation Formulae [Ott10], and the Method of Lines [Kee02] approach. In this work, three approaches based on the Finite Difference method are investigated: an explicit scheme, an adaptive explicit scheme, and an implicit scheme; along with their associated sequential CPU bound and parallel GPU bound algorithms.

### Explicit Finite Difference Scheme

To study the basic properties of the FitzHugh-Nagumo equations an explicit scheme was devised using forward and central differences for the temporal and spatial derivatives respectively.

 $\displaystyle v^{j + 1}_{i} = \frac{ \Delta t }{ \Delta x^2 } \left ( v^{j}_{i+1} - 2 v^{j}_{i} + v^{j}_{i-1} \right ) + v^{j}_{i} + \Delta t \left ( v^{j}_{i} \cdot \left( 1 - v^{j}_{i} \right) \cdot \left( v^{j}_{i} - a \right ) - w^{j}_{i} \right )$ $\displaystyle w^{j+1}_{i} = \Delta t \epsilon \left( v^{j}_{i} - \gamma w^{j}_{i} \right) + w^{j}_{i}$ $\displaystyle v^{0}_{i} = v_0( i \Delta x, 0) \qquad w^{0}_{i} = 0 \qquad v^{j}_{0} = v^{j}_{1} \qquad v^{j}_{N-2} = v^{j}_{N-1}$ (2)

Truncation errors are linear in time, $\mathcal{O}(\Delta t)$, and quadratic, $\mathcal{O}(\Delta x^{2})$, in space. For stability, $2 \Delta t \le \Delta x^2$, implying the use of inconveniently small time steps, however in practice it seems rare to find any mention of $\Delta x < 0.5$ in the literature. Each time step can be computed sequentially or in parallel in linear time, $\mathcal{O}(n)$. However, there is a significant constant factor improvement delivered by parallel GPU implementations since modern GPUs allow millions of nodes to be updated in parallel giving near constant runtime performance. Further parallelization is possible by trivially distributing contiguous node sets to a combination of multiple machines and multiple GPUs.

#### Experiments

 Figure 1: Depiction of excitation and relaxation with a constant input on the tail end of the axon. Waves travel right-to-left. State of system shown for $t = 306.19$. Figure 2: Impulse at $t = 0$ grows axially before its peak collapses at $t = 37.85$ giving rise to two stable impulses that travel in opposite directions before vanishing at $t = 242.98$.

The explicit scheme was used to investigate the traveling wave and divergent wave behaviors of the FitzHugh-Nagumo equations. Fig. (1) demonstrate the application of a constant impulse, $v(t, L) = 1$, at the end of an unexcited axon, $v_0(x) = 0$, giving rise to oscillatory behavior. Fig. (2) shows a Gaussian impulse applied initially to the center of the axon, $v_0(x) = \exp{- \frac{(x - 100)^2}{10}}$. As the system evolves, the impulse collapses giving rise to two separate impulses that travel in opposite directions before dissipating at the boundaries returning the axon to a completely unexcited state. Both test cases are qualitatively consistent with the literature and share the following parameters:

 $\displaystyle a = 0.01 \qquad \gamma = 2 \qquad \epsilon = 0.01 \qquad N = 401 \qquad \Delta x = 0.5 \qquad 2 \Delta t < \Delta x^2$ (3)

#### Error Analysis

 Figure 3: Varying values of $\Delta x$ w.r.t $\Delta x = 0.000045$ at $t = 0.91$. $\Delta t$ fixed to 0.0098. Figure 4: Varying values of $\Delta t$ w.r.t $\Delta t = 0.00023$ at different values of $t$. $\Delta x$ fixed to 0.5.

Two experiments were ran to verify the suggested truncation errors. The numerical solution given by a sufficiently small step size serves as an analytic solution to $v(x, t)$, which is then used to evaluate how larger steps sizes deviate as measured by the root mean squared error. Fig. (3) looks at varying $\Delta x$ and shows that as the step size as halved, the resulting RMSE is quartered consistent with the expect quadratic truncation term of the scheme. Similarly, Fig. (4) looks at varying $\Delta t$ and shows that independent of $t$, halving the step size results in an equally reduced RMSE consistent with the expected linear truncation term of the scheme.

#### Runtime Performance

 Figure 5: Wall time for memory allocation, memory transfers, and core loop for both CPU and GPU. Figure 6: Wall time for just core loop for both CPU and GPU.

Comparison of sequential CPU and parallel GPU bound algorithms is based on the wall time taken to perform a single iteration to calculate $v^{j+1}$ and $w^{j+1}$. Reported figures are a function of spatial node quantity and the mean run time of 30 iterations. The main bottleneck to performance is transferring buffers between the host and device. Fig. (5) illustrates this effect. For the largest test case considered, $N = 2^{24}$, “GPU w/o tx” delivers 19x faster runtime over “GPU w/ tx” by not transferring intermediate results back to the host. Similarly, it delivers significantly faster runtime over the CPU by a factor of 73x. To better understand this performance boost, Fig. (6) looks at just the core loop execution time. For $N = 2^{24}$, the core loops account for only 3.7% and 14.9% of the execution time on the CPU and GPU respectively with the GPU delivering 18x better performance than the CPU. These percentages increase monotonically, and suggest that memory transfers will eventually be dominated by sufficiently large inputs on GPUs with an abundance of memory.

### Adaptive Explicit Finite Difference Scheme

While investigating the traveling wavefront solution of the system, numerical oscillations were observed as the simulation zeroed in on the steady state solution. To address this issue, an adaptive explicit scheme was devised. A survey of the literature suggested a family of moving grid methods to solve the FitzHugh-Nagumo system based on a Lagrangian formulation [VBSS90] and a method of lines formulation [Zwa11]. Here a heuristic is used to concentrate grid points where the most change takes place.

The formulation of the scheme is identical to the former section with second order spatial derivatives being approximated by Lagrange interpolating polynomials since the technique supports non-uniform grids. A three point, first order truncation error scheme is used. A five point, third order truncation error scheme was considered, but abandoned in favor of the empirically adequate three point scheme.

 $\displaystyle \frac{\partial^2 }{\partial x^2} f(x_1) \approx 2 \sum_{i=0}^2 y_i \prod_{\substack{j = 0 \\ j \neq i}}^2 \frac{1}{(x_i - x_j)} + \frac{f^{(4)}(\xi_x)}{6} \left [ \prod_{\substack{i = 0 \\ i \neq 1}}^{2} (x - x_i) + 2 \sum_{\substack{i=0 \\ i \neq 1}}^{2}(x-x_i) \right ]$ (4)

The first and second spatial derivatives given by the Lagrange interpolating polynomials are used to decide how much detail is needed in the domain in addition to the first temporal derivative given by finite differences. First, a coarse grid is laid down across the entire domain to minimize the expected distance between nodes since the second order derivative has first order truncation error. Next, the magnitude of the first order spatial derivative (second order truncation error) is used to lay down a finer grid when the derivative is greater than a specified threshold. This corresponds to where the waves of the solution are.

Next, the temporal derivative is used to lay down an even finer grid in those areas having an absolute change above a specified threshold. The change in time corresponds to the dynamics of the system, by adding detail in these areas, we can preserve the behavior of the system. Finally, the zeros of the second spatial derivative serve as indicators of inflection points in the solution. These correspond most closely to the location of the traveling wavefronts of the equation. Here, the most detail is provided around a fixed radius of the inflection points since the width of the wavefronts does not depend on parameterization.

Each iteration, the explicit scheme is evaluated on the grid from the previous iteration and those results are then used to perform the grid building scheme. To map the available solution values to the new grid, the points are linearly interpolated if a new grid point falls between two previous points, or mapped directly if there is a stationary grid point between the two iterations. The latter will be the more common case since all grid points are chosen from an underlying uniform grid specified by the user. Linear interpolation will only take place when extra grid points are included in an area.

#### Experiments

Figure 7: Top: Numerical oscillation of a centered Gaussian impulse with $\epsilon = 0$ and all other parameters the same as those given in previous section for $t = 60.11, 90.12, 120.11$. Bottom: Eliminated numerical oscillation based on the adaptive explicit scheme.

Fig. (7) is the motivating example for this scheme and demonstrates how numerical oscillations can be avoided by avoiding calculations in regions with stationary solutions. To demonstrate that the scheme works well for other test cases, the more interesting and dynamic divergent impulse test case is shown in Fig. (8). Here we can see that as time progresses, points are allocated to regions of the grid that are most influenced by the system’s dynamics without sacrificing quality. For the time steps shown, the adaptive scheme used 2-3x fewer nodes than the explicit scheme from the previous section.

Figure 8: Example of adaptive grid on the divergent impulse example for $t \in$ { 15.06, 30.13, 45.05, 60.12, 75.05, 90.11 }. Red lines given by the explicit scheme from the previous section, green dots given by adaptive explicit scheme.

### Implicit Finite Difference Scheme

An implicit Crank-Nicolson scheme is used in this section to solve the FitzHugh-Nagumo equations. To simplify the computations, $w^{j+1}$ is solved explicitly using a second order central difference scheme before solving $v^{j+1}$ leading to the following formulation:

 $\displaystyle \frac{v^{j+1}_{i}}{\Delta t} - \frac{1}{2} \left [ \frac{v^{j+1}_{i+1} - 2 v^{j+1}_{i} + v^{j+1}_{i-1}}{\Delta x^2} + v^{j+1}_{i} \left( 1 - v^{j+1}_{i} \right ) \left( v^{j+1}_{i} - a \right ) \right ]$ $\displaystyle \qquad = \frac{v^{j}_{i}}{\Delta t} + \frac{1}{2} \left [ \frac{v^{j}_{i+1} - 2 v^{j}_{i} + v^{j}_{i-1}}{\Delta x^2} + v^{j}_{i} \left( 1 - v^{j}_{i} \right ) \left( v^{j}_{i} - a \right ) \right ] + \frac{1}{2} \left [ - w^{j}_{i} - w^{j+1}_{i} \right ]$ $\displaystyle w^{j+1}_{i} = 2 \Delta t \epsilon \left( v^{j}_{i} - \gamma w^{j}_{i} \right) + w^{j-1}_{i}$ $\displaystyle v^{0}_{i} = v_0( i \Delta x, 0) \qquad w^{0}_{i} = 0 \qquad v^{j}_{0} = v^{j}_{1} \qquad v^{j}_{N-1} = v^{j}_{N-2}$ (5)

The truncation error for this scheme is $\mathcal{O}\left( \Delta t^2 + \Delta x^2 \right)$ with an improved, albeit still restrictive, requirement over the explicit scheme that $\Delta t \le \Delta x^2$. Based on this formulation, the right-hand side is completely known when $v^{j+1}$ is calculated. This gives a simpler expression to consider:

 $\displaystyle F(v^{j+1}) = LHS(v^{j+1}) - RHS = 0$ (6)

Newton’s method is used to solve the nonlinear function $F$ with an initial guess equal to the previous time step’s values, $\hat{v}^{0} = v^{n}$. To refine the estimate, the following system is solved iteratively until the magnitude of the refinement, $\delta v$, is less than a specified tolerance or machine epsilon.

 $\displaystyle \hat{v}^{n+1} = \hat{v}^{n} - J^{-}(\hat{v}^{n}) F(\hat{v}^{n}) \qquad J(\hat{v}^{n}) \delta v = F(\hat{v}^{n}) \qquad \hat{v}^{n+1} = \hat{v}^{n} - \delta v$ (7)

This formulation gives rise to a tridiagonal Jacobian matrix, $J$, where the first and last row are specified by no-flux boundary conditions, and constant $\alpha$ off-diagonal entries and variable $\beta_i$ diagonal entries given by the partial derivatives of each nodal equation.

$J = \begin{pmatrix} 1 & 1 & 0 & \cdots & & \\ \alpha & \beta_1 & \alpha & 0 & \cdots & \\ 0 & \alpha & \ddots & \alpha & 0 & \cdots \\ \cdots & 0 & \alpha & \ddots & \alpha & 0 \\ & \cdots & 0 & \alpha & \beta_{n-1} & \alpha \\ & & \cdots & 0 & 1 & 1 \\ \end{pmatrix}$
 $\displaystyle \alpha = \frac{\partial}{\partial v_{i \pm 1}} F = - \frac{1}{2} \left [ \frac{1}{\Delta x^2} \right ]$ (8)

 $\beta_i = \frac{\partial}{\partial v_{i}} F = \frac{1}{\Delta t} - \frac{1}{2} \left [ -\frac{2}{\Delta x^2} -3 {v_{i}}^2 + 2(1 + a) v_{i} - a \right ]$ (9)

This tridiagonal system can be solved sequentially in $\mathcal{O}(n)$ time using the Thomas algorithm or in $\mathcal{O}(n \log_2 n)$ time using the Cyclic Reduction algorithm of [Hoc65]. Cyclic Reduction begins with a forward phase in which all odd indexed unknowns are eliminated recursively until a $2 \times 2$ or $3 \times 3$ system remains that can be solved directly. Cyclic Reduction ends with a backward phase that walks up the recursion stack to solve for the previously eliminated odd indexed unknowns. Since the algorithm decouples unknowns at each recursion level of the forward and backwards phases, these reductions can be done in parallel in $\mathcal{O}(\log_2 n)$ time on the GPU assuming $n-$fold parallelism.

Further parallelism can be achieved by solving points explicitly along the domain, then using those results to create implicit subdomains that can be solved using either Thomas or Cyclic Reduction algorithms on a combination of multiple machines and multiple GPUs at the expense of additional communication and coordination.

#### Error Analysis

 Figure 9: Varying values of $\Delta x$ w.r.t $\Delta x = 0.012$ at $t = 0.51$. $\Delta t$ fixed to 0.000071. Figure 10: Varying values of $\Delta t$ w.r.t $\Delta t = 0.00025$ at different values of $t$. $\Delta x$ fixed to 0.52.

Evaluation of the spatial error revealed an unexpected linear behavior as shown in Fig. (9). As the spatial step is halved, the resulting error was expected to become quartered, instead it is halved. No clear explanation was discovered to account for this discrepancy. With respect to time, Fig. (10) shows that both the Thomas and Cyclic Reduction algorithms were quadratic as multiple points in time were evaluated. The Thomas algorithm produced aberrations as the step size increased eventually producing numerical instability, while the Cyclic Reduction algorithm was immune to this issue.

 Figure 11: Stability of implicit solvers. Figure 12: Convergence of implicit solvers.

In terms of stability, the implicit scheme is stable up to $\Delta t \le \Delta x^2$ depending on which tridiagonal solver is used, and for comparison, the explicit scheme is stable up to precisely $2 \Delta t \le \Delta x^2$ as shown in Fig. (11). The Thomas algorithm demonstrates a slightly weaker stability guarantee than the Cyclic Reduction algorithm becoming unstable around $\Delta t / \Delta x^2 \approx 0.9$ and $\Delta t / \Delta x^2 \approx 1.05$ respectively. In terms of convergence, the Thomas algorithm typically takes more iterations than Cyclic Reduction to obtain the same degree of accuracy as shown in Fig. (12). Taking the sequence of adjustments up to machine epsilon $(2^{-52})$, Thomas algorithm gives $\lVert \delta v^{n+1} \rVert < 0.037 \lVert \delta v^{n} \rVert$ suggesting q-linear convergence. Likewise, Cyclic Reduction gives $\lVert \delta v^{n+1} \rVert < 0.17 \lVert {\delta v^{n}}^2 \rVert$ suggesting q-quadratic convergence.

#### Runtime Performance

 Figure 13: Performance comparison of CPU and GPU bound Thomas and Cyclic Reduction algorithms. Figure 14: Performance comparison of Jacobian solvers.

Sequential Thomas and Cyclic Reduction routines perform equally well on CPU as shown in Fig. (13). The parallel Cyclic Reduction method did not demonstrate significant performance gains on the GPU. However, looking at just the time taken to solve the Jacobian each iteration (not including initialization or memory transfers), parallel Cyclic Reduction on the GPU was 4-5x faster than both sequential CPU solvers as shown in Fig. (14).

To explain the poor performance of Cyclic Reduction on the GPU, there are a number of different factors at play. The algorithm is susceptible to warp divergence due to the large number of conditionals that take place. Reliance on global memory access with varying strides contributes to slow performance since shared memory can’t be effectively utilized, and each iteration the adjustments are transferred from device to host to decide if Newton’s method should terminate. These different factors suggest alternative GPU implementations need to be investigated to address these different issues.

### Discussion

#### Work Environment

All results in this paper were based on code written in C#, and compiled using Microsoft Visual Studio Express with Release settings to run on a commodity class Intel Core i7-2360QM quad core processor. Open source CUDAfy.NET is used to run C# to CUDA translated code on a commodity class NVIDIA GeForce GT 525m having two streaming multiprocessors providing 96 CUDA cores.

#### Future Work

Numerically, additional work could be done on the adaptive explicit scheme. In preparing the scheme, a cubic spline-based approach was abandoned in favor of simpler approaches due to time pressures. It would be worthwhile to explore how to solve the system on a spline-based “grid”.

In addition, further work could be done to optimize the implementation of the parallel Cyclic Reduction algorithm on the GPU since it delivered disappointing runtime behavior compared to the sequential algorithm on the CPU. [CmWH14] mention several different optimizations to try, and I believe better global memory access will improve runtime at the expense of more complicated addressing. As an alternative to Cyclic Reduction, both [CmWH14] and [ZCO10] detail several different parallel tridiagonal solvers that could be deployed.

In terms of models considered, there are a number of different directions that could be pursued including higher dimensional [MC04], coupled [Cat14]], [Ril06], and stochastic variants [Tuc13] of the spatial FitzHugh-Nagumo equation. Coming from a probabilistic background, I would be interested in investing time in learning how to solve stochastic ordinary and partial differential equations.

#### Conclusion

Three finite difference schemes were evaluated. An explicit scheme shows great performance on both the CPU and GPU, but it is susceptible to numerical oscillations. To address this issue, an adaptive explicit scheme based on heuristics was devised and is able to eliminate these issues while requiring fewer nodes to produce results on-par with the explicit scheme. An implicit scheme was evaluated which demonstrated a principled, and robust solution for a variety of test cases and is the favored approach of the three evaluated.

### References

[AK15] Gianni Arioli and Hans Koch. Existence and stability of traveling pulse solutions of the fitzhugh-nagumo equation. Nonlinear Analysis: Theory, Methods and Applications, 113:51-70, 2015

[Cat14] Anna Cattani. Fitzhugh-nagumo equations with generalized diffusive coupling. Mathematical Biosciences and Engineering, 11(2):203-215, April 2014

[CmWH14] Li-Wen Chang and Wen mei W. Hwu. A guide for implementing tridiagonal solvers on gpus. In Numerical Computations with GPUs, pages 29-44. Springer International Publishing, 2014.

[Fit61] Richard FitzHugh. Impulses and physiological states in theoretical models of nerve membrane. Biophysical journal, 1(6):445, 1961.

[HH52] Alan L. Hodgkin and Andrew F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of physiology, 117(4):500-544, 1952.

[Hoc65] R. W. Hockney. A fast direct solution of poisson’s equation using fourier analysis. J. ACM, 12(1):95-113, Jan 1965.

[Kee02] James P. Keener. Spatial modeling. In Computational Cell Biology,volume 20 of Interdisciplinary Applied Mathematics, pages 171-197. Springer New York, 2002.

[MC04] Maria Murillo and Xiao-Chuan Cai. A fully implicit parallel algorithm for simulating the non-linear electrical activity of the heart. Numerical linear algebra with applications, 11(2-3):261-277, 2004.

[NAY62] J. Nagumo, S. Arimoto, and S. Yoshizawa. An active pulse transmission line simulating nerve axon. Proceedings of the IRE, 50(10):2061-2070, Oct 1962.

[Ott10] Denny Otten. Mathematical models of reaction diffusion systems, their numerical solutions and the freezing method with comsol multiphysics. 2010.

[Ril06] Caroline Jane Riley. Reaction and diffusion on the sierpinkski gasket. PhD thesis, University of Manchester, 2006.

[Tuc13] Henry C. Tuckwell. Stochastic partial differential equations in neurobiology. Linear and nonlinear models for spiking neurons. In Stochastic Biomathematical Models, volume 2058 of Lecture Notes in Mathematics. Springer Berlin Heidelberg, 2013.

[VBSS89] J. G. Verwer, J. G. Blom, and J. M. Sanz-Serna. An adaptive moving grid method for one dimensional systems of partial differential equations. Journal of Computational Physics, 82(2):454-486, 1989.

[ZCO10] Yao Zhang, Jonathan Cohen, and John D. Owens. Fast tridiagonal solvers on the gpu. SIGPLAN Not., 45(5):127-136, Jan 2010.

[Zwa11] M. N. Zwarts. A test set for an adaptive moving grid pde solver with time-dependent adaptivity. Master’s thesis. Utrecht University, Utrecht, Netherlands, 2011.

Written by lewellen

2015-11-30 at 1:58 pm

## Expected Maximum and Minimum of Real-Valued Continuous Random Variables

### Introduction

This is a quick paper exploring the expected maximum and minimum of real-valued continuous random variables for a project that I’m working on. This paper will be somewhat more formal than some of my previous writings, but should be an easy read beginning with some required definitions, problem statement, general solution and specific results for a small handful of continuous probability distributions.

### Definitions

Definition (1) : Given the probability space, $(\Omega, \mathcal{F}, \mathbb{P})$, consisting of a set representing the sample space, $\Omega$, a $\text{Borel }\sigma \text{-algebra}$, $\mathcal{F}$, and a Lebesgue measure, $\mathbb{P}$, the following properties hold true:

1. Non-negativity: $\mathbb{P}(F) \ge 0 \quad \forall F \in \mathcal{F}$
2. Null empty set: $\mathbb{P}(\emptyset) = 0$
3. Countable additivity of disjoint sets $\displaystyle \mathbb{P}\left( \bigcup_{i=0}^{\infty} F_i \right) = \sum_{i=0}^{\infty} \mathbb{P}(F_i) \quad F_i \subset \mathcal{F}$

Definition (2) : Given a real-valued continuous random variable such that $X : \Omega \to \mathbb{R}$, the event the random variable takes on a fixed value, $x \in \mathbb{R}$, is the event $\lbrace \omega : X(\omega) = x \rbrace \in \mathcal{F}$ measured by the probability distribution function $f_X(x) = \mathbb{P}(X = x)$. Similarly, the event that the random variable takes on a range of values less than some fixed value, $x \in \mathbb{R}$, is the event $\lbrace \omega : X(\omega) \le x \rbrace \in \mathcal{F}$ measured by the cumulative distribution function $F_X(x) = \mathbb{P}(X \le x)$. By Definition, the following properties hold true:

1. $\displaystyle F_X(x) = \int_{-\infty}^{x} f_X(t) \, dt$
2. $\displaystyle \lim_{x \to \infty} F_X(x) = 1$
3. $\displaystyle 1 - \int_{-\infty}^{x} t f_X(t) \, dt = \int_{x}^{\infty} t f_X(t) \, dt$

Defintion (3) : Given a second real-valued continuous random variable, $Y : \Omega \to \mathbb{R}$, The joint event $\lbrace \omega : X(\omega) = x, Y(\omega) = y \rbrace \in \mathcal{F}$ $(x,y) \in \mathbb{R}^2$ will be measured by the joint probability distribution $f_{X, Y}(x,y) = \mathbb{P}(X = x, Y = y)$. If $X$ and $Y$ are statistically independent, then $f_{X,Y}(x,y) = f_X(x) f_Y(y)$.

Definition (4) : Given a real-valued continuous random variable, $X : \Omega \to \mathbb{R}$, the expected value is $\displaystyle \mathbb{E}(X) = \int_{-\infty}^{\infty} x f_X(x) \, dx$.

Definition (5) : (Law of the unconscious statistician) Given a real-valued continuous random variable, $X$, and a function, $g : \mathbb{R} \to \mathbb{R}$, then $g(X)$ is also a real-valued continuous random variable and its expected value is $\displaystyle \mathbb{E}(g(X)) = \int_{-\infty}^{\infty} g(x) f_X(x) \, dx$ provided the integral converges. Given two real-valued continuous random variables, $X, Y$, and a function, $g : \mathbb{R}^2 \to \mathbb{R}$, then $g(X, Y)$ is also a real-valued continuous random variable and its expected value is $\displaystyle \mathbb{E}(g(X,Y)) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} g(x,y) f_{X,Y}(x,y) \, dx \, dy$. Under the independence assumption of Definition (3), the expected value becomes $\displaystyle \mathbb{E}(g(X,Y)) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} g(x,y) f_X(x) f_Y(y) \, dx \, dy$.

Remark (1) : For the remainder of this paper, all real-valued continuous random variables will be assumed to be independent.

### Problem Statement

Theorem (1) : Given two real-valued continuous random variables $X, Y \in \Omega \to \mathbb{R}$, then the expected value of the minimum of the two variables is $\mathbb{E} \left( \min{ \left( X, Y \right) } \right ) = \mathbb{E} \left( X \right ) + \mathbb{E} \left( Y \right ) - \mathbb{E} \left( \max{ \left( X, Y \right) } \right )$.

Lemma (1) : Given two real-valued continuous random variables $X, Y \in \Omega \to \mathbb{R}$, then the expected value of the maximum of the two variables is $\displaystyle \mathbb{E} \left( \max{ \left ( X, Y \right) } \right ) = \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy$

Proof of Lemma (1) :

$\displaystyle \mathbb{E} \left( \max{ \left ( X, Y \right) } \right ) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \max{\left( x, y \right)} f_X(x) f_Y(y) \, dx \, dy$ (Definition (5))

$\displaystyle = \int_{-\infty}^{\infty} \int_{-\infty}^{x} x f_X(x) f_Y(y) \, dy \, dx + \int_{-\infty}^{\infty} \int_{-\infty}^{y} y f_X(x) f_Y(y) \, dx \, dy$ (Definition (1.iii))

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) \left ( \int_{-\infty}^{x} f_Y(y) \, dy \right ) \, dx + \int_{-\infty}^{\infty} y f_Y(y) \left ( \int_{-\infty}^{y} f_X(x) \, dx \right ) \, dy$ (Fubini’s theorem)

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy \quad \square$ (Definition (2.i))

Proof of Theorem (1)

$\displaystyle \mathbb{E} \left( \min{ \left ( X, Y \right) } \right ) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \min{\left( x, y \right)} f_X(x) f_Y(y) \, dx \, dy$ (Definition (4))

$\displaystyle = \int_{-\infty}^{\infty} \int_{x}^{\infty} x f_X(x) f_Y(y) \, dy \, dx + \int_{-\infty}^{\infty} \int_{y}^{\infty} y f_X(x) f_Y(y) \, dx \, dy$ (Definition (1.iii))

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) \left ( \int_{x}^{\infty} f_Y(y) \, dy \right ) \, dx + \int_{-\infty}^{\infty} y f_Y(y) \left ( \int_{y}^{\infty} f_X(x) \, dx \right ) \, dy$ (Fubini’s theorem)

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) \left (1 - \int_{-\infty}^{x} f_Y(y) \, dy \right ) \, dx + \int_{-\infty}^{\infty} y f_Y(y) \left (1 - \int_{-\infty}^{y} f_X(x) \, dx \right ) \, dy$ (Definition (2.iii))

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) \left (1 - F_Y(x) \right ) \, dx + \int_{-\infty}^{\infty} y f_Y(y) \left( 1 - F_X(y) \right ) \, dy$ (Definition (2.i))

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) \, dx - \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) \, dy - \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy$

$\displaystyle = \int_{-\infty}^{\infty} x f_X(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) \, dy - \left ( \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy \right )$

$\displaystyle = \mathbb{E}(X) + \mathbb{E}(Y) - \mathbb{E} \left( \max{ \left( X, Y \right) } \right) \quad \blacksquare$ (Definition (4), Lemma (1))

Remark (2) : For real values $x, y \in \mathbb{R}$, $\min{\left(x,y\right)} = x + y - \max{ \left( x, y \right) }$.

Proof Remark (2) : If $x \ge y$, then $\min{\left(x,y\right)} = y$, otherwise $x$. If $x \ge y$, then $\max{\left(x,y\right)} = x$, otherwise $y$. If $x \ge y$, then $\min{\left(x,y\right)} = y + \left( x - \max{\left(x,y\right)} \right )$, otherwise, $\min{\left(x,y\right)} = x + \left( y - \max{\left(x,y\right)} \right )$. Therefore, $\min{\left(x,y\right)} = x + y - \max{\left(x,y\right)} \quad \square$

### Worked Continuous Probability Distributions

The following section of this paper derives the expected value of the maximum of real-valued continuous random variables for the exponential distribution, normal distribution and continuous uniform distribution. The derivation of the expected value of the minimum of real-valued continuous random variables is omitted as it can be found by applying Theorem (1).

#### Exponential Distribution

Definition (6) : Given a real-valued continuous exponentially distributed random variable, $X \sim \text{Exp}(\alpha)$, with rate parameter, $\alpha > 0$, the probability density function is $\displaystyle f_X(x) = \alpha e^{-\alpha x}$ for all $x \ge 0$ and zero everywhere else.

Corollary (6.i) The cumulative distribution function of a real-valued continuous exponentially distributed random variable, $X \sim \text{Exp}(\alpha)$, is therefore $\displaystyle F_X(x) = 1 - e^{-\alpha x}$ for all $x \ge 0$ and zero everywhere else.

Proof of Corollary (6.i)

$\displaystyle F_X(x) = \int_{-\infty}^{x} f_x(t) \, dt = \int_{0}^{x} \alpha e^{-\alpha t} \, dt = -\frac{\alpha}{\alpha} e^{-\alpha t} \bigg|_{0}^{x} = 1 - e^{- \alpha x} \quad \square$

Corollary (6.ii) : The expected value of a real-valued continuous exponentially distributed random variable, $X \sim \text{Exp}(\alpha)$, is therefore $\displaystyle \frac{1}{\alpha}$.

Proof of Corollary (6.ii)

The expected value is $\mathbb{E}(X) = \frac{1}{\alpha}$ by Definition (4) and Lemma (2) $\square$.

Lemma (2) : Given real values $\alpha, \gamma \in \mathbb{R} \quad \gamma \neq 0$, then $\displaystyle \int_{0}^{\infty} \alpha x e^{-\gamma x} \, dx = \frac{\alpha}{\gamma^2}$.

Proof of Lemma (2) :

$\displaystyle \int_{0}^{\infty} \alpha x e^{-\gamma x} \, dx = - x \frac{\alpha}{\gamma} e^{-\alpha x} \bigg|_{0}^{\infty} + \int_{0}^{\infty} \frac{\alpha}{\gamma} e^{-\gamma x} \, dx = - x \frac{\alpha}{\gamma} e^{-\alpha x} - \frac{\alpha}{\gamma}^2 e^{-\alpha x} \bigg|_{0}^{\infty}$

$\displaystyle = \lim_{x \to \infty} \left( - x \frac{\alpha}{\gamma} e^{-\alpha x} - \frac{\alpha}{\gamma^2} e^{-\alpha x} \right ) - \left( - \frac{\alpha}{\gamma^2} \right) = \frac{\alpha}{\gamma^2} \quad \square$

Theorem (2) : The expected value of the maximum of the real-valued continuous exponentially distributed random variables $X \sim \text{Exp}(\alpha)$, $Y \sim \text{Exp}(\beta)$ is $\displaystyle \frac{1}{\alpha} + \frac{1}{\beta} - \frac{1}{\alpha + \beta}$.

Proof of Theorem (2) :

$\displaystyle \mathbb{E} \left ( \max{\left( X, Y \right)} \right ) = \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy$ (Lemma (1))

$\displaystyle = \int_{0}^{\infty} x \alpha e^{-\alpha x} \left( 1 - e^{-\beta x} \right ) \, dx + \int_{0}^{\infty} y \beta e^{-\beta y} \left( 1 - e^{-\alpha y} \right ) \, dy$ (Corollary (6.i))

$\displaystyle = \left( \int_{0}^{\infty} x \alpha e^{-\alpha x} \, dx \right )- \left( \int_{0}^{\infty} x \alpha e^{-(\alpha + \beta) x} \, dx \right ) + \left( \int_{0}^{\infty} y \beta e^{-\beta y} \, dy \right ) - \left( \int_{0}^{\infty} y \beta e^{-(\alpha + \beta) y} \, dy \right)$ (Integral linearity)

$\displaystyle = \frac{1}{\alpha} - \frac{\alpha}{(\alpha + \beta)^2} + \frac{1}{\beta} - \frac{\beta}{(\alpha + \beta)^2}$ (Lemma (2), Corollary (6.ii))

$\displaystyle = \frac{1}{\alpha} + \frac{1}{\beta} - \frac{1}{\alpha+\beta} \quad \blacksquare$

#### Normal Distribution

Definition (7) : The following Gaussian integral is the error function $\displaystyle \text{erf}(x) = \frac{2}{ \sqrt{\pi} } \int_{0}^{x} e^{ - u^2 } \, du$ for which the following properties hold true:

1. Odd function: $\displaystyle \text{erf}(-x) = -\text{erf}(x)$
2. Limiting behavior: $\displaystyle \lim_{x \to \infty} \text{erf}(x) = 1$

Definition (8) : Given a real-valued continuous normally distributed random variable , $X \sim \mathcal{N}(\mu, \sigma)$, with mean parameter, $\mu$. and standard deviation parameter, $\sigma \neq 0$, the probability density function is $\displaystyle f_X(x) = \frac{1}{\sigma \sqrt{2 \pi} } e^{ -\frac{1}{2} \left ( \frac{x - \mu}{\sigma} \right )^2 }$ for all values on the real line.

Corollary (8.i) : The cumulative distribution function of a real-valued continuous normally distributed random variable, $X \sim \mathcal{N}(\mu, \sigma)$, is therefore $\displaystyle F_X(x) = \frac{1}{2} \left (1 + \text{erf} \left ( \frac{x-\mu}{\sqrt{2} \sigma} \right ) \right )$.

Proof of Corollary (8.i) :

$\displaystyle F_X(x) = \int^{x}_{-\infty} \frac{1}{\sigma \sqrt{2 \pi} } e^{ - \left ( \frac{t - \mu}{\sqrt{2} \sigma} \right )^2 } \, dt$ (Definition (2.i))

$\displaystyle = \frac{1}{ \sqrt{\pi} } \int_{-\infty}^{\frac{x - \mu}{\sqrt{2} \sigma}} e^{ - u^2 } \, du$ (U-substitution with $\displaystyle u = \frac{t - \mu}{\sqrt{2} \sigma} \implies du = \frac{1}{\sqrt{2} \sigma} dt$)

$\displaystyle = \frac{1}{ \sqrt{\pi} } \int_{-\infty}^{ 0 } e^{ - u^2 } \, du + \frac{1}{ \sqrt{\pi} } \int_{0}^{ \frac{x - \mu}{\sqrt{2} \sigma} } e^{ - u^2 } \, du$ (Definition (2.iii))

$\displaystyle = - \frac{1}{ \sqrt{\pi} } \int_{0}^{-\infty} e^{ - u^2 } \, du + \frac{1}{ \sqrt{\pi} } \int_{0}^{ \frac{x - \mu}{\sqrt{2} \sigma} } e^{ - u^2 } \, du$ (Reverse limits of integration)

$\displaystyle = \frac{1}{2} \lim_{u \to \infty} - \text{erf}(-u) + \frac{1}{2} \text{erf} \left ( \frac{x - \mu}{\sqrt{2} \sigma} \right )$ (Definition (7))

$\displaystyle = \frac{1}{2} \lim_{u \to \infty} \text{erf}(u) + \frac{1}{2} \text{erf} \left ( \frac{x - \mu}{\sqrt{2} \sigma} \right )$ (Definition (7.i))

$\displaystyle = \frac{1}{2} + \frac{1}{2} \text{erf} \left ( \frac{x - \mu}{\sqrt{2} \sigma} \right )$ (Definition (7.ii))

$\displaystyle = \frac{1}{2} \left (1 + \text{erf} \left ( \frac{x-\mu}{\sqrt{2} \sigma} \right ) \right ) \quad \square$

Corollary (8.ii) : The expected value of a real-valued continuous normally distributed random variable, $X \sim \mathcal{N}(\mu, \sigma)$, is therefore $\mathbb{E}(X) = \mu$.

Proof of Corollary (8.ii) :

$\displaystyle \mathbb{E}(X) = \int_{-\infty}^{\infty} x f_X(x) \, dx = \int_{-\infty}^{\infty} x \frac{1}{\sigma \sqrt{2 \pi} } e^{ -\frac{1}{2} \left ( \frac{x - \mu}{\sigma} \right )^2 } \, dx$ (Definition (4))

$\displaystyle = \int_{-\infty}^{\infty} (\sqrt{2}\sigma u + \mu) \frac{1}{\sqrt{\pi} } e^{ - u^2 } \, du$ (U-substitution with $\displaystyle u = \frac{x-\mu}{\sqrt{2} \sigma} \implies du = \frac{1}{\sqrt{2}\sigma} dx$ $\sqrt{2}\sigma u + \mu = x$)

$\displaystyle = \frac{\sqrt{2}\sigma}{\sqrt{\pi}} \int_{-\infty}^{\infty} u e^{ - u^2 } \, du + \frac{\mu}{\sqrt{\pi}} \int_{-\infty}^{\infty} e^{ - u^2 } \, du$ (Integral linearity)

$\displaystyle = \frac{\sqrt{2}\sigma}{\sqrt{\pi}} \left( \int_{-\infty}^{0} u e^{ - u^2 } \, du + \int_{0}^{\infty} u e^{ - u^2 } \, du \right ) + \frac{\mu}{\sqrt{\pi}} \int_{-\infty}^{\infty} e^{ - u^2 } \, du$ (Definition (1.iii))

$\displaystyle = \frac{\sqrt{2}\sigma}{\sqrt{\pi}} \left( - \int_{0}^{\infty} u e^{ - u^2 } \, du + \int_{0}^{\infty} u e^{ - u^2 } \, du \right ) + 2 \frac{\mu}{\sqrt{\pi}} \int_{0}^{\infty} e^{ - u^2 } \, du$ ($u e^{-u^2}$ is odd, $e^{-u^2}$ is even)

$\displaystyle = \mu \frac{2}{\sqrt{\pi}} \left ( \frac{\sqrt{\pi}}{2} \lim_{x \to \infty} \text{erf}(x) \right ) = \mu \quad \square$ (Definition (7), Definition (7.ii))

Definition (9) : Given a real-valued continuous normally distributed random variable, $X \sim \mathcal{N}(0, 1)$, the probability distribution function will be denoted as standard normal probability distribution function, $\phi(x)$, and the cumulative distribution function as the standard normal cumulative distribution function, $\Phi(x)$. By definition, the following properties hold true:

1. Non-standard probability density function: If $X \sim \mathcal{N}(\mu, \sigma)$, then $\displaystyle f_X(x) = \frac{1}{\sigma} \phi \left( \frac{x - \mu}{\sigma} \right )$
2. Non-standard cumulative distribution function: If $X \sim \mathcal{N}(\mu, \sigma)$, then $\displaystyle F_X(x) = \Phi\left( \frac{x - \mu}{\sigma} \right )$
3. Complement: $\Phi(-x) = 1 - \Phi(x)$

Definition (10) : [PaRe96] Given $\phi(x)$ and $\Phi(x)$, the following integrals hold true:

1. $\displaystyle \int_{-\infty}^\infty x\Phi(a+bx)\phi(x) \, dx = \frac{b}{\sqrt{1+b^2}} \phi \left( \frac{a}{\sqrt{1+b^2}} \right )$
2. $\displaystyle \int_{-\infty}^\infty \Phi(a+bx)\phi(x) \, dx = \Phi \left ( \frac{a}{\sqrt{1+b^2}} \right )$

Theorem (3) : The expected value of the maximum of the real-valued continuous normally distributed random variables $X \sim \mathcal{N}(\mu, \sigma)$, $Y \sim \mathcal{N}(\nu, \tau)$ is $\displaystyle \sqrt{ \sigma^2 + \tau^2 } \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + (\nu - \mu) \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right ) + \mu$.

Lemma (3) : Given real-valued continuous normally distributed random variables $X \sim \mathcal{N}(\mu, \sigma)$, $Y \sim \mathcal{N}(\nu, \tau)$, $\displaystyle \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy = \frac{\tau^2}{\sqrt{\sigma^2 + \tau^2 }} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \nu \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right )$.

Proof of Lemma (3) :

$\displaystyle \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy = \int_{-\infty}^{\infty} y \frac{1}{\tau} \phi \left ( \frac{y-\nu}{\tau} \right ) \Phi \left ( \frac{y-\mu}{\sigma} \right ) \, dy$ (Definition (9.i), Definition (9.ii))

$\displaystyle = \int_{-\infty}^{\infty} (u \tau + \nu) \phi(u) \Phi \left ( \frac{u \tau + \nu -\mu}{\sigma} \right ) \, du$ (U-substitution with $\displaystyle u = \frac{y-\nu}{\tau} \implies du = \frac{1}{\tau} dy$, $y = u \tau + \nu$)

$\displaystyle = \tau \int_{-\infty}^{\infty} u \phi(u) \Phi \left ( \frac{u \tau + \nu -\mu}{\sigma} \right ) \, du + \nu \int_{-\infty}^{\infty} \phi(u) \Phi \left ( \frac{u \tau + \nu -\mu}{\sigma} \right ) \, du$ (Integral linearity)

$\displaystyle = \frac{\tau^2}{\sqrt{\sigma^2 + \tau^2 }} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \nu \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right ) \, \square$ (Definition (10.i), Definition (10.ii))

Proof of Theorem (3) :

$\displaystyle \mathbb{E} \left( \max{ \left ( X, Y \right) } \right ) = \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy$ (Lemma (1))

$\displaystyle = \int_{-\infty}^{\infty} x \frac{1}{\sigma} \phi \left ( \frac{x-\mu}{\sigma} \right ) \Phi \left ( \frac{x-\nu}{\tau} \right ) \, dy + \int_{-\infty}^{\infty} y \frac{1}{\tau} \phi \left ( \frac{y-\nu}{\tau} \right ) \Phi \left ( \frac{y-\mu}{\sigma} \right ) \, dy$ (Definition (11.i), Definition (11.ii))

$\displaystyle = \frac{\sigma^2}{\sqrt{\sigma^2 + \tau^2}} \phi \left( \frac{\mu - \nu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \mu \Phi \left ( \frac{\mu - \nu}{\sqrt{\sigma^2 + \tau^2}} \right ) + \frac{\tau^2}{\sqrt{\sigma^2 + \tau^2 }} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \nu \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right )$ (Lemma (3))

$\displaystyle = \frac{\sigma^2}{\sqrt{\sigma^2 + \tau^2}} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \mu \left ( 1 - \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right ) \right ) + \frac{\tau^2}{\sqrt{\sigma^2 + \tau^2 }} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \nu \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right )$ (Definition (9.iii))

$\displaystyle = \frac{\sigma^2}{\sqrt{\sigma^2 + \tau^2}} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) - \mu \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right ) + \mu + \frac{\tau^2}{\sqrt{\sigma^2 + \tau^2 }} \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + \nu \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right )$

$\displaystyle = \sqrt{ \sigma^2 + \tau^2 } \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + (\nu - \mu) \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right ) + \mu \quad \blacksquare$

#### Continuous Uniform Distribution

Definition (11) : Given a real-valued continuous uniformly distributed random variable, $X \sim U(a,b)$, with inclusive boundaries $a, b$ such that $a < b$, the probability density function is $\displaystyle f_X(x) = \frac{1}{b-a}$ for all $x \in [a, b]$ and zero everywhere else.

Corollary (11.i) : The cumulative distribution function of a real-valued continuous uniformly distributed random variable, $X \sim U(a,b)$, is therefore $\displaystyle F_X(x) = \begin{cases} 0 & x < a \\ \frac{x-a}{b-a} & x \in [a,b] \\ 1 & x > b \end{cases}$.

Proof of Corollary (11.i) :

$\displaystyle F_X(x) = \int_{-\infty}^{\infty} f_X(t) \, dt = \int_{a}^{b} \frac{1}{b-a} \, dt = \frac{1}{b-a} x \bigg|_{a}^{x} = \begin{cases} 0 & x < a \\ \frac{x-a}{b-a} & x \in [a,b] \\ 1 & x > b \end{cases}\quad \square$.

Corollary (11.ii) : The expected value of a real-valued continuous uniformly distributed random variable, $X \sim U(a,b)$, is therefore $\displaystyle \frac{a+b}{2}$.

Proof of Corollary (11.ii)

$\displaystyle \mathbb{E}(X) = \int_{-\infty}^{\infty} x f_X(x) \, dx = \int_{a}^{b} x \frac{1}{b-a} \, dx = \frac{x^2}{2(b-a)} \bigg|_{a}^{b} = \frac{ b^2 -a^2 }{ 2(b-a) } = \frac{b+a}{2} \quad \square$

Theorem (4) : The expected value of the maximum of real-valued continuous uniformly distributed random variables $X \sim U(a,b)$, $Y \sim U(c,d)$ is $\displaystyle \begin{cases} \frac{c+d}{2} & a < b \le c < d \\ \frac{4 (b^3 - c^3) - 3 (c + a) (b^2 - c^2) }{6(d-c)(b-a)} + \frac{d^2 - b^2}{2(d-c)} & a \le c < b \le d \\ \frac{4(b^3-c^3) - 3(c+a)(b^2-c^2)}{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} & a \le c < d \le b \\ \frac{4 (b^3-a^3) - 3 (c + a) (b^2 - a^2) }{6(d-c)(b-a)} + \frac{d^2-b^2}{2(d-c)} & c \le a < b \le d\\ \frac{4 (d^3 - a^3) -3 (c+a) (d^2-a^2) }{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} & c \le a < d \le b \\ \frac{a+b}{2} & c < d \le a < b \end{cases}$.

Proof of Theorem (4) :

$\displaystyle \mathbb{E} \left ( \max{ \left ( X, Y \right )} \right ) = \int_{-\infty}^{\infty} x f_X(x) F_Y(x) \, dx + \int_{-\infty}^{\infty} y f_Y(y) F_X(y) \, dy$ (Lemma (1))

$\displaystyle = \int_{a}^{b} x \frac{1}{b-a} \begin{cases} 0 & x < c \\ \frac{x - c}{d-c} & x \in [c,d] \\ 1 & \text{otherwise} \end{cases} \, dx + \int_{c}^{d} y \frac{1}{d-c} \begin{cases} 0 & y < a \\ \frac{y - a}{b-a} & y \in [a,b] \\ 1 & \text{otherwise} \end{cases} \, dy$

Case (1) : $a < b \le c < d$

$\displaystyle = \left ( \int_{a}^{b} x \frac{1}{b-a} 0 \, dx \right ) + \left ( \int_{c}^{d} y \frac{1}{d-c} 1 \, dy \right )$

$\displaystyle = \frac{c+d}{2} \quad \square$

Case (2) : $a \le c < b \le d$

$\displaystyle = \left ( \int_{a}^{c} x \frac{1}{b-a} 0 \, dx + \int_{c}^{b} x \frac{1}{b-a} \frac{x-c}{d-c} \, dx \right ) + \left ( \int_{c}^{b} y \frac{1}{d-c} \frac{y-a}{b-a} \, dy + \int_{b}^{d} y \frac{1}{d-c} 1 \, dy \right )$

$\displaystyle = \frac{2 x^3 - 3 c x^2}{6(b-a)(d-c)} \bigg|_{c}^{b} + \frac{2 y^3 - 3ay^2 }{6(d-c)(b-a)} \bigg|_{c}^{b} + \frac{y^2}{2(d-c)} \bigg|_{b}^{d}$

$\displaystyle = \frac{2 (b^3 - c^3) - 3 c (b^2 - c^2) }{6(b-a)(d-c)} + \frac{2 (b^3 - c^3) - 3 a (b^2 - c^2) }{6(d-c)(b-a)} + \frac{d^2 - b^2}{2(d-c)}$

$\displaystyle = \frac{4 (b^3 - c^3) - 3 (c + a) (b^2 - c^2) }{6(d-c)(b-a)} + \frac{d^2 - b^2}{2(d-c)} \quad \square$

Case (3) : $a \le c < d \le b$

$\displaystyle = \left ( \int_{a}^{c} x \frac{1}{b-a} 0 \, dx + \int_{c}^{b} x \frac{1}{b-a} \frac{x-c}{d-c} \, dx + \int_{d}^{b} x \frac{1}{b-a} 1 \, dx \right ) + \left ( \int_{c}^{d} y \frac{1}{d-c} \frac{y-a}{b-a} \, dy \right)$

$\displaystyle = \frac{2x^3 - 3cx^2}{6(b-a)(d-c)} \bigg|_{c}^{b} + \frac{x^2}{2(b-a)} \bigg|_{d}^{b} + \frac{2y^3 - 3ay^2}{6(b-a)(d-c)} \bigg|_{c}^{d}$

$\displaystyle = \frac{2(b^3-c^3) - 3c(b^2-c^2)}{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} + \frac{2(d^3-c^3) - 3a(d^2-c^2)}{6(b-a)(d-c)}$

$\displaystyle = \frac{4(b^3-c^3) - 3(c+a)(b^2-c^2)}{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} \quad \square$

Case (4) : $c \le a < b \le d$

$\displaystyle = \left( \int_{a}^{b} x \frac{1}{b-a} \frac{x-c}{d-c} \, dx \right ) + \left( \int_{c}^{a} y \frac{1}{d-c} 0 \, dy + \int_{a}^{b} y \frac{1}{d-c} \frac{y-a}{b-a} \, dy + \int_{b}^{d} y \frac{1}{d-c} 1 \, dy \right )$

$\displaystyle = \frac{2 x^3 - 3 c x^2 }{6(d-c)(b-a)} \bigg|_{a}^{b} + \frac{2 y^3 - 3 a y^2 }{6(d-c)(b-a)} \bigg|_{a}^{b} + \frac{y^2}{2(d-c)} \bigg|_{b}^{d}$

$\displaystyle = \frac{2 (b^3-a^3) - 3 c (b^2 - a^2) }{6(d-c)(b-a)} + \frac{2 (b^3-a^3) - 3 a (b^2 -a^2) }{6(d-c)(b-a)} + \frac{d^2-b^2}{2(d-c)}$

$\displaystyle = \frac{4 (b^3-a^3) - 3 (c + a) (b^2 - a^2) }{6(d-c)(b-a)} + \frac{d^2-b^2}{2(d-c)} \quad \square$

Case (5) : $c \le a < d \le b$

$\displaystyle = \left ( \int_{a}^{d} x \frac{1}{b-a} \frac{x-c}{d-c} \, dx + \int_{d}^{b} x \frac{1}{b-a} 1 \, dx \right ) + \left ( \int_{c}^{a} y \frac{1}{d-c} 0 \, dy + \int_{a}^{d} y \frac{1}{d-c} \frac{y-a}{b-a} \, dy \right )$

$\displaystyle = \frac{2 x^3 -3 c x^2}{6(b-a)(d-c)} \bigg|_{a}^{d} + \frac{x^2}{2(b-a)} \bigg|_{d}^{b} + \frac{2 y^3 -3 a y^2}{6(b-a)(d-c)} \bigg|_{a}^{d}$

$\displaystyle = \frac{2 (d^3 - a^3) -3 c (d^2-a^2) }{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} + \frac{2 (d^3-a^3) -3 a (d^2-a^2)}{6(b-a)(d-c)}$

$\displaystyle = \frac{4 (d^3 - a^3) -3 (c+a) (d^2-a^2) }{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} \quad \square$

Case (6) : $c < d \le a < b$

$\displaystyle = \left ( \int_{a}^{b} x \frac{1}{b-a} 1 \, dx \right ) + \left ( \int_{c}^{d} y \frac{1}{d-c} 0 \, dy \right )$

$\displaystyle = \frac{a+b}{2}$

$\displaystyle \therefore \mathbb{E} \left ( \max{\left ( X, Y \right )} \right ) = \begin{cases} \frac{c+d}{2} & a < b \le c < d \\ \frac{4 (b^3 - c^3) - 3 (c + a) (b^2 - c^2) }{6(d-c)(b-a)} + \frac{d^2 - b^2}{2(d-c)} & a \le c < b \le d \\ \frac{4(b^3-c^3) - 3(c+a)(b^2-c^2)}{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} & a \le c < d \le b \\ \frac{4 (b^3-a^3) - 3 (c + a) (b^2 - a^2) }{6(d-c)(b-a)} + \frac{d^2-b^2}{2(d-c)} & c \le a < b \le d\\ \frac{4 (d^3 - a^3) -3 (c+a) (d^2-a^2) }{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} & c \le a < d \le b \\ \frac{a+b}{2} & c < d \le a < b \end{cases} \quad \blacksquare$

### Summary Table

The following summary table lists the expected value of the maximum of real-valued continuous random variables for the exponential distribution, normal distribution and continuous uniform distribution. The corresponding minimum can be obtained by Theorem (1).

Random Variables Maximum
$X \sim$ $Y \sim$
$\text{Exp}(\alpha)$ $\text{Exp}(\beta)$ $\displaystyle \frac{1}{\alpha} + \frac{1}{\beta} - \frac{1}{\alpha + \beta}$
$\mathcal{N}(\mu, \sigma)$ $\mathcal{N}(\nu, \tau)$ $\displaystyle \sqrt{ \sigma^2 + \tau^2 } \phi \left( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2 }} \right ) + (\nu - \mu) \Phi \left ( \frac{\nu - \mu}{\sqrt{\sigma^2 + \tau^2}} \right ) + \mu$
$\text{U}(a, b)$ $\text{U}(c, d)$ $\displaystyle \begin{cases} \frac{c+d}{2} & a < b \le c < d \\ \frac{4 (b^3 - c^3) - 3 (c + a) (b^2 - c^2) }{6(d-c)(b-a)} + \frac{d^2 - b^2}{2(d-c)} & a \le c < b \le d \\ \frac{4(b^3-c^3) - 3(c+a)(b^2-c^2)}{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} & a \le c < d \le b \\ \frac{4 (b^3-a^3) - 3 (c + a) (b^2 - a^2) }{6(d-c)(b-a)} + \frac{d^2-b^2}{2(d-c)} & c \le a < b \le d\\ \frac{4 (d^3 - a^3) -3 (c+a) (d^2-a^2) }{6(b-a)(d-c)} + \frac{b^2-d^2}{2(b-a)} & c \le a < d \le b \\ \frac{a+b}{2} & c < d \le a < b \end{cases}$

### References

[GrSt01] Grimmett, Geoffrey, and David Stirzaker. Probability and Random Processes. Oxford: Oxford UP, 2001. Print.

[PaRe96] Patel, Jagdish K., and Campbell B. Read. Handbook of the Normal Distribution. 2nd ed. New York: Marcel Dekker, 1996. Print.

Written by lewellen

2013-01-01 at 8:00 am

## Tropical Representation of the All-Pairs Shortest Path Problem

### Motivation

While I was doing my Abstract Algebra research the other month, I came across an interesting way of simplifying the representation of the all-pairs shortest path problem using Tropical Geometry. I thought it was pretty clever, so I thought I’d do a quick write-up.

### Problem Statement

The all-pairs shortest path problem is to identify the minimum path cost, $\Omega(\pi) = \sum_{e \in \pi} \omega(e)$, out of the possible paths $\pi_{i,j} \in \Pi_{i,j}$ between vertices $v_{i}$ and $v_{j}$.

### Proposition

Consider a weighted directed graph (digraph), $G = (V, E, \omega)$, consisting of vertices, $V$, and directed edges (arcs), $E \subseteq V \times V$, and a function, $\omega : E \to \overline{\mathbb{R}}_{+}$, yielding the weight of an edge. Only those weights from the positive affinely extended real numbers, $\overline{\mathbb{R}}_{+} = \mathbb{R}_{+} \cup \lbrace \infty \rbrace$, are allowed per the problem statement. The adjacency matrix representation, $D \in \overline{\mathbb{R}}_{+}^{\lvert V \rvert \times \lvert V \rvert}$, of $G$ is given by the following matrix:

$D_{i, j} = \begin{cases} 0 & i = j \\ \omega((v_{i}, v_{j})) & (v_{i}, v_{j}) \in E \\ \infty & \text{otherwise} \end{cases}$

Now, consider a semi-ring over $x, y \in \overline{\mathbb{R}}_{+}$ whose additive operator, $\oplus \in \overline{\mathbb{R}}_{+} \to \overline{\mathbb{R}}_{+}$, is given by the minimum function, $x \oplus y = \min(x,y)$, and whose multiplicative operator, $\odot \in \overline{\mathbb{R}}_{+} \to \overline{\mathbb{R}}_{+}$, is given by addition, $x \odot y = x + y$. The additive unit is given by infinity, $x \oplus \infty = x$, and the multiplicative unit by zero, $x \odot 0 = x$. This semi-ring is the Tropical Semi-ring $\mathbb{T} = \left ( \overline{\mathbb{R}}_{+}, \oplus, \odot \right )$. (The namesake of tropical is in honor of Brazilian Imre Simon who developed this branch of mathematics.)

Linear algebra constructs can be tropicalized to yield the familiar definitions for matrix addition and multiplication for matricies $A, B \in \overline{\mathbb{R}}_{+}^{n \times m}$ and $C \in \overline{\mathbb{R}}_{+}^{m \times n}$.

$\displaystyle \left (A \oplus B \right )_{i, j} = A_{i,j} \oplus B_{i, j}$

$\displaystyle (A \odot C)_{i,j} = \bigoplus_{k}^{m} A_{i, k} \odot C_{k, j}$

Given the two prior statements, the elegant solution to the all-pairs shortest path problem is given by taking powers of the adjacency matrix: $D_{G}^{\odot \lvert V \rvert - 1}$.

### Proof

To see how this works out, start with $D_{G}$. The matrix represents the minimum cost between any two adjacent vertices. In other words, the minimum cost for all paths containing a single edge. The next inductive step is to consider paths containing at most two adjacent edges. Squaring the adjacency matrix yields all such paths. When the matrix is squared, each edge is concatenated to all other adjacent edges and the minimum weight of the paths is selected. This thought process can iterated as follows:

$\displaystyle D_{G}^{\odot r} = D_{G}^{\odot r - 1} \odot D_{G}$
$\displaystyle D_{i, j}^{\odot r} = \bigoplus_{k}^{m} D_{i, k}^{\odot r - 1} \odot D_{k, j}$
$\displaystyle D_{i, j}^{\odot r} = \min { \lbrace D_{i, k}^{\odot r - 1} + D_{k, j} \rbrace }$ $\forall k \in [1, m]$

The result is a typical Bellman equation. A graph can have at most $\lvert V \rvert - 1$ edges between any two vertices, thus, the solution to the all-pairs shortest path problem is given by $\displaystyle D_{G}^{\odot \lvert V \rvert - 1}$.

### Example

As a worked example, consider the following graph whose set of vertices is given by the set $V = \lbrace a, b, c, d \rbrace$, set of arcs by $E = \lbrace (a,b), (a,c), (a,d), (b,c), (b, d), (c,d) \rbrace$ and weight function, $\omega$, as labeled on the graph.

The all-pairs shortest paths are given by the following calculations where the row and column coordinates correspond to the vertices of $V$. Values in bold denote a change in the shortest path between two vertices.

$D_{G} = \begin{pmatrix}0 & 1 & 8 & 12\\\infty & 0 & 2 & 10\\\infty & \infty & 0 & 3 \\ \infty & \infty & \infty & 0 \end{pmatrix}$ $D_{G}^{\odot 2} = \begin{pmatrix}0 & 1 & \textbf{3} & \textbf{11}\\\infty & 0 & 2 & \textbf{5}\\\infty & \infty & 0 & 3 \\ \infty & \infty & \infty & 0 \end{pmatrix}$ $D_{G}^{\odot 3} = \begin{pmatrix}0 & 1 & 3 & \textbf{6}\\\infty & 0 & 2 & 5\\\infty & \infty & 0 & 3 \\ \infty & \infty & \infty & 0 \end{pmatrix}$

### Computational Complexity

From asymptotic standpoint, tropical matrix multiplication is still on the order of traditional matrix multiplication of $\mathcal{O}(\lvert V \rvert^{3} )$. Computing the all-pairs shortest path problem using this approach is on the order of $\mathcal{O}(\lvert V \rvert^{4})$ since we must perform the tropical matrix multiplication $\lvert V \rvert - 1$ times. Now, This can be improved slightly since tropical matrix multiplication is associative, so we can leverage the repeated squaring approach and reduce the time complexity down to $\mathcal{O}(\lvert V \rvert^{3} \log{\lvert V \rvert})$.

The time complexity can be further reduced to $\mathcal{O}(\lvert V \rvert^{3} )$ using the Floyd-Warshall Algorithm, which is another dynamic programming approach that is similar in form to the tropical representation of the problem. In essence, it follows the same base case, but it’s recurrence statement only considers a range of vertices with respect to the two vertices being considered. A more in depth review of the algorithm can be found in the references.

### References

Floyd-Warshall’s Algorithm.” Algorithmist. Web. 12 Apr. 2012.

Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. “25.2 The Floyd-Warshall Algorithm.” Introduction to Algorithms. 2nd ed. Cambridge, MA: MIT, 2001. 629-35. Print.

Diestel, Reinhard. Graph theory. Heidelberg New York: Springer, 2010.

Laface, Antonio. Introduction to Tropical Geometry [pdf]. 29 Nov. 2006. Web. 11 Apr. 2012.

Maclagan, Diane, and Bernd Sturmfels. Introduction to Tropical Geometry [pdf]. 4 Nov. 2009. Web. 9 Apr. 2012.

Mohri, Mehryar. “Semiring Frameworks and Algorithms for Shortest-Distance Problems” [pdf]. Journal of Automata, Languages and Combinatorics 7 (2002) 3: 321-50. 8 Aug. 2002. Web. 31 Mar. 2012.

Written by lewellen

2012-06-01 at 8:00 am

## Minesweeper Agent

with one comment

### Introduction

Lately I’ve been brushing up on probability, statistics and machine learning and thought I’d play around with writing a Minesweeper agent based solely on these fields. The following is an overview of the game’s mechanics, verification of an implementation, some different approaches to writing the agent and some thoughts on the efficacy of each approach.

### Minesweeper

#### Background

Minesweeper was created by Curt Johnson in the late eighties and later ported to Windows by Robert Donner while at Microsoft. With the release of Windows 3.1 in 1992, the game became a staple of the operating system and has since found its way onto multiple platforms and spawned several variants. The game has been shown to be NP-Complete, but in practice, algorithms can be developed to solve a board in a reasonable amount of time for the most common board sizes.

#### Specification

##### Gameplay
An agent, $\mathcal{A}$, is presented a $n \times m$ grid containing $M$ uniformly distributed mines. The agent’s objective is to expose all the empty grid locations and none of the mines. Information about the mines’ grid locations is gained by exposing empty grid locations which will indicate how many mines exist within a unit (Chebyshev) distance of the grid location. If the exposed grid location is a mine, then the player loses the game. Otherwise, once all empty locations are exposed, the player wins. $\textbf{PLAY-GAME}(\mathcal{A}, n, m, M) \newline \indent H \leftarrow \textbf{INITIALIZE-HIDDEN}(n,m,M) \newline \indent V \leftarrow \textbf{INITIALIZE-VISIBLE}(n,m,M) \newline \newline \indent \textbf{do} \newline \indent \indent (i,j) \leftarrow \mathcal{A}(V) \newline \indent \indent \textbf{EXPOSE}(H, V, i, j) \newline \indent \indent \textbf{if} \quad V_{i,j} = \text{*} \newline \indent \indent \indent \textbf{return} \quad \textit{Failure} \newline \indent \textbf{while} \quad M \neq \lvert \textbf{U-LOCATIONS}(V) \rvert \newline \newline \indent \textbf{return} \quad \textit{Success}$
##### Initialization
The board consists of hidden and visible states. To represent the hidden, $H$, and visible state, $V$, of the board, two character matrices of dimension $n \times m$ are used.

Characters ‘0’-‘8’ represent the number of neighboring mines, character ‘U’ to represent an unexposed grid location and character ‘*’ for a mine.

Neighbors of a grid location $(i,j)$ is the set of grid locations such that $1 \le \lVert (u,v) - (i,j) \rVert_{\infty} \le r$. By default, $r = 1$.

$\textbf{INITIALIZE-VISIBLE}(n, m) \newline \indent \textbf{return} \quad \textbf{INITIALIZE}(n,m, \text{U})$

$\textbf{INITIALIZE-HIDDEN}(n, m, M) \newline \indent H \leftarrow \textbf{INITIALIZE}(n,m,\text{0}) \newline \newline \indent S \leftarrow \emptyset \newline \indent \textbf{while} \quad \lvert S \lvert < M \newline \indent \indent S \leftarrow S \cup \textbf{RANDOM}(\textbf{LOCATIONS}(H)) \newline \newline \indent \textbf{foreach} \quad (i, j) \in S \newline \indent \indent H_{i, j} = \text{*} \newline \indent \indent \textbf{foreach} \quad (u,v) \in \textbf{NEIGHBORS}(n, m, i, j) \setminus S \newline \indent \indent \indent H_{u,v} \leftarrow H_{u,v} + 1 \newline \newline \indent \textbf{return} \quad H$

##### Exposing Cells
The expose behavior can be thought of as a flood fill on the grid, exposing any empty region bordered by grid locations containing mine counts and the boundaries of the grid.

A matrix, $F \in \mathbb{Z}^{n \times m}$, represents the topography of the board. A value of zero is reserved for sections of the board that have yet to be visited, a value of one for those that have, two for those that are boundaries and three for mines. A stack, $S$, keeps track of locations that should be inspected.

If a cell location can be exposed, then each of its neighbors will be added to the stack to be inspected. Those neighbors that have already been inspected will be skipped. Once all the reachable grid locations have been inspected, the process terminates.

$\textbf{EXPOSE}(H, V, i, j) \newline \indent \textbf{foreach} \quad (u,v) \in \textbf{LOCATIONS}(H) \newline \indent \indent \textbf{if} \quad H_{u,v} = \text{0} \newline \indent \indent \indent F_{u,v} \leftarrow 0 \newline \indent \indent \textbf{if} \quad H_{i,j} = \text{*} \newline \indent \indent \indent F_{u,v} \leftarrow 3 \newline \indent \indent \textbf{else} \newline \indent \indent \indent F_{u,v} \leftarrow 2 \newline \newline \indent \textbf{PUSH}(S, (i, j)) \newline \indent \textbf{do} \newline \indent \indent (u, v) \leftarrow \textbf{POP}(S) \newline \indent \indent \textbf{if} \quad F_{u,v} = 0 \newline \indent \indent \indent F_{u,v} \leftarrow 1 \newline \indent \indent \indent \textbf{foreach} \quad (r,s) \in \textbf{NEIGHBORS}(H, u, v) \newline \indent \indent \indent \indent \textbf{PUSH}(S, (r,s)) \newline \indent \indent \textbf{elseif} \quad F_{u,v} \in (1, 2) \newline \indent \indent \indent V_{u,v} \leftarrow H_{u, v} \newline \indent \textbf{while} \quad \textbf{COUNT}(S) > 0$

#### Verification

##### Methodology

Statistical tests are used to verify the random aspects of the game’s implementation. I will skip the verification of the game’s logic as it requires use of a number of different methods that are better suited for their own post.

There are two random aspects worth thinking about: the distribution of mines and the distribution of success (i.e., not clicking a mine) for random trials. In both scenarios it made since to conduct Pearson’s chi-squared test. Under this approach there are two hypotheses:

• $H_{0}$: The distribution of experimental data follows the theoretical distribution
• $H_{a}$: The distribution experimental data does not follow the theoretical distribution

$H_{0}$ is accepted when the test statistic, $\chi^2$, is less than the critical value, $\chi_{\alpha}^2$. The critical value is determined by deciding on a p-value (e.g., 0.05, 0.01, 0.001), $\alpha$, that results in the tail area beneath the chi-squared distribution $\chi_{k}^2(x)$ equal to $\alpha$. $k$ is the degrees of freedom in the observation.

##### Mine distribution

The first aspect to verify was that mines were being uniformly placed on the board. For a standard $9 \times 9$ board with $10$ mines, the expectation is that each grid location should be assigned $\frac{10}{81} N$ times for $N$ trials. $k = 80$ for this experiment.

In the above experiment, $\chi^2 = 71.78$ and $\chi_{0.05}^2 = 101.87$. Since $\chi^2 < \chi_{0.05}^2$, this affirms $H_{0}$ and that the implemented distribution of mines is indeed uniform with a statistical significance of $0.05$.

##### Distribution of successful clicks

The second aspect to verify is that the number of random clicks before exposing a mine follows a hypergeometric distribution. The hypergeometric distribution is appropriate since we are sampling (exposing) without replacement (the grid location remains exposed after clicking). This hypothesis relies on a non-flood-fill exposure.

The distribution has four parameters. The first is the number of samples drawn (number of exposures), the second the number of successes in the sample (number of empty exposures), the third the number of successes in the population (empty grid locations) and the last the size of the population (grid locations): $h(nm-M;nm-M,nm-M, nm)$.

The expected frequencies for the hypergeometric distribution is given by $N h(nm - M; nm - M, nm - M, nm)$ for $N$ trials. $k = 70$ in this case.

In the above experiment $\chi^2 = 47.24$ and $\chi_{0.05}^2 = 90.53$. Since $\chi^2 < \chi_{0.05}^2$, this affirms $H_{0}$ and that the number of locations exposed prior to exposing a mine follows a hypergeometric distribution with a statistical significance of $0.05$.

Also included in the plot is the observed distribution for a flood based exposure. As one might expect, the observed frequency of more exposures decreases more rapidly than that of the non-flood based exposure.

### Agents

#### Methodology

Much like how a human player would learn to play the game, I decided that each model would have knowledge of game’s mechanics and no prior experience with the game. An alternative class of agents would have prior experience with the game as the case would be in a human player who had studied other player’s strategies.

To evaluate the effectiveness of the models, each played against a series of randomly generated grids and their respective success rates were captured. Each game was played on a standard beginner’s $9 \times 9$ grid containing between $[1, 10]$ mines.

For those models that refer to a probability measure, $\mathbb{P}$, it is assumed that the measure is determined empirically and treated as an estimate of the probability of an event and not as an a priori measure.

#### Marginal Model

##### Development

The first model to consider is the Marginal Model. It is designed to simulate the behavior of a naive player who believes that if he observes a mine at a grid location that the location should be avoid in future trials.

The model treats the visible board, $V$, as a matrix of discrete random variables where each grid location is interpreted as either $\textit{Empty}$ or (a) $\textit{Mine}$. This model picks the grid location with the greatest empirical probability of being empty:

$\underset{(i,j)}{\arg\max} \quad \mathbb{P}(X_{i,j} = \textit{Empty})$

$\textbf{MARGINAL-MODEL}(V) \newline \indent p_{\textit{max}} \leftarrow \perp \newline \indent (i, j)_{\textit{max}} \leftarrow \perp \newline \newline \indent \textbf{foreach} \quad (i,j) \in \textbf{U-LOCATIONS}(V) \newline \indent \indent p \leftarrow \mathbb{P}(V_{i,j} = \textit{Empty}) \newline \indent \indent \textbf{if} \quad p > p_{\textit{max}} \newline \indent \indent \indent p_{\textit{max}} \leftarrow p \newline \indent \indent \indent (i,j)_{\textit{max}} \leftarrow (i,j) \newline \newline \indent \textbf{return} \quad (i,j)_{\textit{max}}$

##### Test Results

Since the mine distribution is uniform, the model should be equivalent to selecting locations at random. The expected result is that avoiding previously occupied grid locations is an ineffective strategy as the number of mines increases. This does however, provide an indication of what the success rate should look like for chance alone.

#### Conditional Model

##### Development

One improvement over the Marginal Model is to take into account the visual clues made visible when an empty grid location is exposed. Since an empty grid location represents the number of neighboring mines, the Conditional Model can look at these clues to determine whether or not an unexposed grid location contains a mine.

This boils down to determining the probability of $\mathbb{P}(\textit{Mine} \lvert \textrm{Evidence})$. A simplification in calculating the probability is to assume that each piece of evidence is independent. Under this assumption the result is a Naïve Bayes Classifier:

$\mathbb{P}( C = c \vert X = x ) = \dfrac{\mathbb{P}(C = c) \prod \mathbb{P}( X_{i} = x_{i} \vert C = c)}{\prod\mathbb{P}(X_{i} = x_{i})}$

As in the case of the Marginal Model, the Conditional Model returns the grid location that it has determined has the greatest probability of being empty given its neighbors:

$\underset{(i,j)}{\arg\max} \quad \mathbb{P}(C_{i,j} = \textit{Empty} | N(V_{i,j}) )$

$\textbf{CONDITIONAL-MODEL}(V, r) \newline \indent C \leftarrow \lbrace \textit{Empty}, \textit{Mine} \rbrace \newline \indent S \leftarrow \textbf{U-LOCATIONS}(V) \newline \newline \indent T \leftarrow \emptyset \newline \indent \textbf{foreach} \quad (i,j) \in S \newline \indent \indent N \leftarrow \textbf{NEIGHBORS}(V, i, j, r) \newline \indent \indent p_{\textit{max}} \leftarrow \perp \newline \indent \indent c_{\textit{max}} \leftarrow \perp \newline \indent \indent \textbf{foreach} \quad c \in C \newline \indent \indent \indent p \leftarrow \mathbb{P}(C = c) \newline \indent \indent \indent \textbf{foreach} \quad (u,v) \in N \newline \indent \indent \indent \indent p \leftarrow p * \mathbb{P}( X_{i, j} = V_{i, j} \vert C = c ) \newline \indent \indent \indent \textbf{if} \quad p > p_{\textit{max}} \newline \indent \indent \indent \indent p_{\textit{max}} \leftarrow p \newline \indent \indent \indent \indent c_{\textit{max}} \leftarrow c \newline \indent \indent \textbf{if} \quad c_{\textit{max}} = \textit {Empty} \newline \indent \indent \indent T \leftarrow T \cup (i, j) \newline \newline \indent \textbf{return} \quad \textbf{RANDOM}(T)$
##### Test Results

The Naïve Bayes Classifier is regarded as being an effective approach to classifying situations for a number of different tasks. In this case, it doesn’t look like it is effective at classifying mines from non-mines. The results are only slightly better than the Marginal Model.

#### Graphical Model

##### Development

One shortfall of the Conditional Model is that it takes a greedy approach in determining which action to take. A more sophisticated approach is to not just consider the next action, but the possible sequence of actions that will minimize the possibility of exposing a mine.

Each of the possible observable grids, $S$, can be thought of as a set of vertices in a graph whose corresponding set of edges represent the transition between a state, $s$, to the next observable state, $s^{\prime}$. Each transition was achieved by performing an action, $a \in A$, on the state. The specific action, $\alpha : S \times S \to A$, is chosen from a subset of permitted actions given the state. Each transition has a probability, $\mathbb{P}(s^{\prime} \vert s)$, of taking place.

It is possible to pick a path, $\pi$, through this graph that minimizes the risk by assigning a reward, $\rho : S \to \mathbb{R}$, to each state and attempting to identify an optimal path, $\pi^{*}_{s}$, from the present state that yields the greatest aggregate reward,

$\displaystyle \varrho(\pi) = \sum_{(s, s^{\prime}) \in \pi} \rho(s^{\prime}) \mathbb{P}(s^{\prime} \vert s)$

Solving for $\pi^{*}_{s}$ is equivalent to solving the Longest Path Problem and can be computed efficiently using a dynamic programming solution.

$\displaystyle \pi_{s}^{*} \gets \underset{\pi_{s}}{\arg\max} \ \sum_{(\sigma, \sigma^{\prime}) \in \pi_{s}} \rho(\sigma^{\prime}) \mathbb{P}(\sigma^{\prime} \vert \sigma) \ \pi_{s} \in \Pi_{s}$

$\textbf{GRAPHICAL-MODEL}(V) \newline \indent (a, r)_{\textit{max}} \gets (\bot, \bot) \newline \indent T \gets \emptyset \newline \newline \indent \textbf{foreach} \quad U \in \textbf{SUB-GRIDS}(V) \newline \indent \indent (a, r)_{U} \gets \textbf{OPTIMAL-ACTION}(U, \bot) \newline \indent \indent \textbf{if} \quad r_{U} > r_{\textit{max}} \newline \indent \indent \indent (a,r)_{\textit{max}} \gets (a,r)_{U} \newline \indent \indent \indent T \gets \emptyset \newline \indent \indent \textbf{if} \quad r_{U} = r_{\textit{max}} \newline \indent \indent \indent T \gets T \cup (a,r)_{U} \newline \newline \indent (a, r)_{\textit{max}} \gets \textbf{RANDOM}(T) \newline \newline \indent \textbf{return} \quad a_{\textit{max}} \newline \newline \textbf{OPTIMAL-ACTION}(V) \newline \indent (a, r)_{\textit{max}} \gets (\bot, \bot) \newline \newline \indent \textbf{foreach} \quad (i,j) \in \textbf{U-LOCATIONS}(V) \newline \indent \indent \textbf{foreach} \quad V^{\prime} \in \textbf{OUTCOMES}(V, (i,j)) \newline \indent \indent \indent (a, r)_{V^{\prime}} \gets \textbf{OPTIMAL-ACTION}(V^{\prime}) \newline \indent \indent \indent r \gets r_{V^{\prime}} + \mathbb{P}(V^{\prime} \vert V) * \textbf{REWARD}(V^{\prime}) \newline \indent \indent \indent \textbf{if} \quad r \ge r_{\textit{max}} \newline \indent \indent \indent \indent (a, r)_{\textit{max}} \gets ((i,j), r) \newline \newline \indent \textbf{return} \quad (a,r)_{\textit{max}} \newline \newline \textbf{REWARD}(V) \newline \indent \textbf{if} \quad \textbf{SUCCESS}(V) \newline \indent \indent \indent \textbf{return} \quad +100 \newline \indent \textbf{if} \quad \textbf{FAILURE}(V) \newline \indent \indent \indent \textbf{return} \quad -100 \newline \newline \indent \textbf{return} +1$

From the optimal walk, a sequence of optimal actions is determined by mapping $\alpha$ over the path. Taking the first action gives the optimal grid location to expose given the current visible state of the board.

This description constitutes a Markov Decision Process. As is the case for most stochastic processes, it is assumed that the process holds the Markov Property; that future states only depend upon the current states and none of the prior states. In addition to being a Markov Decision Process, this is also an example of Reinforcement Learning.

First thing to observe is that the game state space is astronomical. For a standard beginner’s grid there is at most a sesvigintillion $(10^{81})$ possible grids that a player can encounter. Which as an aside, is on the order of the number of atoms in the observable universe! The set of actions at each state is slightly more manageable with at most eighty-one actions.

To simplify the state space, I chose to only consider $3 \times 3$ boards and when evaluating a full grid, consider the possible sub-grids and evaluate the optimal sequence of actions for each sub-grid and pick the maximum reward associated for each sub-grid that was evaluated as the action to take on the full grid.

##### Test Results

The Graphical Model produces results that are only a margin better than those of the Conditional Model.

#### Semi-deterministic Model

##### Development

The last model I’m going to talk about is a semi-deterministic model. It works by using the visible grid to infer the topology of the hidden grid and from the hidden grid, the topology that the visible grid can become.

The grid can be viewed as a graph. Each grid location is a vertex and an edge is an unexposed grid location’s influence on another grid location’s neighbor mine count.

For each of the exposed grid locations on the board, $v_{i,j}$, it’s neighbors, $N(v_{i,j})$, are all mines when the number of inbound edges $E_{i,j} = d(v_{i,j})$, matches the visible mine count $V_{i,j}$.

The model produces its inferred version, $F$, of the influence graph $E$ by using the determined mine locations $M$.

For each of the grid locations that are exposed and the inferred influence matches the visible count, then each of the neighbors about that location can be exposed provided they are not already exposed and not an inferred mine.

From this set of possibilities, a mine location is chosen. When no mine locations can be determined, then an alternative model can be used.

$\textbf{SEMI-DETERMINISTIC-MODEL}(V) \newline \indent E \leftarrow 0_{n, m} \newline \indent \textbf{foreach} \quad (i,j) \in \textbf{LOCATIONS}(V) \newline \indent \indent \textbf{if} \quad V_{i,j} \ne \textit{U} \newline \indent \indent \indent \textbf{continue} \newline \indent \indent \textbf{foreach} \quad (u,v) \in \textbf{NEIGHBORS}(V, i, j) \newline \indent \indent \indent E_{u,v} \leftarrow E_{u,v} + 1 \newline \newline \indent M \leftarrow \textit{False}_{n,m} \newline \indent \textbf{foreach} \quad (i,j) \in \textbf{LOCATIONS}(V) \newline \indent \indent \textbf{if} \quad V_{i,j} \in \lbrace \textit{U}, \textit{0} \rbrace \lor V_{i,j} \neq E_{i,j} \newline \indent \indent \indent \textbf{continue} \newline \indent \indent \textbf{foreach} \quad (u,v) \in \textbf{NEIGHBORS}(V, i, j) \newline \indent \indent \indent \textbf{if} \quad V_{i,j} = \textit{U} \newline \indent \indent \indent \indent M_{u,v} \leftarrow \textit{True} \newline \newline \indent F \leftarrow 0_{n,m} \newline \indent \textbf{foreach} \quad (i,j) \in \textbf{LOCATIONS}(V) \newline \indent \indent \textbf{foreach} \quad (u,v) \in \textbf{NEIGHBORS}(V, i, j) \newline \indent \indent \indent \textbf{if} \quad M_{u,v} \newline \indent \indent \indent \indent F_{i,j} \leftarrow F_{i,j} + 1 \newline \newline \indent S \leftarrow \emptyset \newline \indent \textbf{foreach} \quad (i,j) \in \textbf{LOCATIONS}(V) \newline \indent \indent \textbf{if} \quad V_{i,j} = \textit{U} \lor F_{i,j} \ne V_{i,j} \newline \indent \indent \indent \textbf{continue} \newline \indent \indent \textbf{foreach} \quad (u,v) \in \textbf{NEIGHBORS}(V, i, j) \newline \indent \indent \indent \textbf{if} \quad V_{i,j} \ne \textit{U} \lor M_{u,v} \newline \indent \indent \indent \indent \textbf{continue} \newline \indent \indent \indent S \leftarrow S \cup (u, v) \newline \newline \indent \textbf{return} \quad \textbf{RANDOM}(S)$

##### Test Results

Since the model is a more direct attempt at solving the board, its results are superior to the previously presented models. As the number of mines increases, it is more likely that it has to rely on a more probabilistic approach.

### Summary

Each of the models evaluated offered incremental improvements over their predecessors. Randomly selecting locations to expose is on par with choosing a location based on previously observed mine locations. The Conditional Model and Graphical Model yield similar results since they both make decisions based on conditioned probabilities. The Semi-deterministic Model stands alone as the only one model that produced reliable results.

The success rate point improvement between the Condition and Marginal models is most notable for boards consisting of three mines and the improvement between Graphical and Semi-deterministic models for seven mines. Improvements between Random and Marginal models is negligible and between Conditional and Graphical is minor for all mine counts fewer than seven.

Given the mathematical complexity and nondeterministic nature of the machine learning approaches, (in addition the the complexity and time involved in implementing those approaches) they don’t seem justified when more deterministic and simpler approaches exist. In particular, it seems like most people have implemented their agents using heuristics and algorithms designed to solve constraint satisfaction problems. Nonetheless, this was a good refresher to some of the elementary aspects of probability, statistics and machine learning.

### References

Classification – Naïve Bayes.” Data Mining Algorithms in R. Wikibooks. 3 Nov. 2010. Web. 30 Oct. 2011.

Windows Minesweeper.” MinesweeperWiki. 8 Sept. 2011. Web. 30 Oct. 2011.

Kaye, Richard. “Minesweeper Is NP-complete.” [pdf] Mathematical Intelligencer 22.2 (2000): 9-15. Web. 30 Oct. 2011.

Nakov, Preslav, and Zile Wei. “MINESWEEPER, #MINESWEEPER.” 14 May 2003. Web. 14 Apr. 2012.

Richard, Sutton, and Andrew G. Barto. “3.6 Markov Decision Processes.” Reinforcement Learning: An Introduction. Cambridge, Massachusetts: Bradford Book, 1998. 4 Jan. 2005. Web. 30 Oct. 2011.

Rish, Irene “An Empirical Study of the Naive Bayes Classifer.” [pdf] IJCAI-01 Workshop on Empirical Methods in AI (2001). Web. 30 Oct. 2011.

Russell, Stuart J., and Peter Norvig. Artificial Intelligence: A Modern Approach. Upper Saddle River, NJ: Prentice Hall/PearsonEducation., 2003. Print.

Sun, Yijun, and Jian Li. “Adaptive Learning Approach to Landmine Detection.” [pdf] IEEE Transactions of Aerospace and Electronic Systems 41.3 (2005): 1-9. 10 Jan. 2006. Web. 30 Oct. 2011.

Taylor, John R. An introduction to error analysis: the study of uncertainties in physical measurements. Sausalito, CA: University Science Books, 1997. Print.

Written by lewellen

2012-05-01 at 8:00 am

## Abstract Algebra in C#

### Motivation

In C++ it is easy to define arbitrary template methods for computations involving primitive numeric types because all types inherently have arithmetic operations defined. Thus, a programmer need only implement one method for all numeric types. The compiler will infer the use and substitute the type at compile time and emit the appropriate machine instructions. This is C++’s approach to parametric polymorphism.

With the release of C# 2.0 in the fall of 2005, the language finally got a taste of parametric polymorphism in the form of generics. Unfortunately, types in C# do not inherently have arithmetic operations defined, so methods involving computations must use ad-hoc polymorphism to achieve the same result as in C++. The consequence is a greater bloat in code and an increased maintenance liability.

To get around this design limitation, I’ve decided to leverage C#’s approach to subtype polymorphism and to draw from Abstract Algebra to implement a collection of interfaces allowing for C++ like template functionality in C#. The following is an overview of the mathematical theory used to support and guide the design of my solution. In addition, I will present example problems from mathematics and computer science that can be represented in this solution along with examples how type agnostic computations that can be performed using this solution.

### Abstract Algebra

Abstract Algebra is focused on how different algebraic structures behave in the presence of different axioms, operations and sets. In the following three sections, I will go over the fundamental sub-fields and how they are represented under the solution.

In all three sections, I will represent the distinction between algebraic structures using C# interfaces. The type parameters on these interfaces represent the sets being acted upon by each algebraic structure. This convention is consistent with intuitionistic (i.e., Chruch-style) type theory embraced by C#’s Common Type System (CTS). Use of parameter constraints will be used when type parameters are intended to be of a specific type. Functions on the set and elements of the set will be represented by methods and properties respectively.

#### Group Theory

Group Theory is the simplest of sub-fields of Abstract Algebra dealing with the study of a single binary operation, $(\cdot)$, acting on a set $a, b, c \in S$. There are five axioms used to describe the structures studied under Group Theory:

1. Closure: $(\cdot) : S \times S \to S$
2. Associativity: $(a \cdot b) \cdot c = a \cdot (b \cdot c)$
3. Commutativity : $a \cdot b = b \cdot a$
4. Identity: $a \cdot e = e \cdot a$
5. Inverse: $a \cdot b = e$

The simplest of these structures is the Groupoid satisfying only axiom (1). Any Groupoid also satisfying axiom (2) is known as a Semi-group. Any Semi-group satisfying axiom (4) is a Monoid. Monoid’s also satisfying axiom (5) are known as Groups. Any Group satisfying axiom (3) is an Abelian Group.

public interface IGroupoid<T> {
T Operation(T a, T b);
}

public interface ISemigroup<T> : IGroupoid<T> {

}

public interface IMonoid<T> : ISemigroup<T> {
T Identity { get; }
}

public interface IGroup<T> : IMonoid<T> {
T Inverse(T t);
}

public interface IAbelianGroup<T> : IGroup<T> {

}


#### Ring Theory

The next logical sub-field of Abstract Algebra to study is Ring Theory which is the study of two operations, $(\cdot)$ and $(+)$, on a single set. In addition to the axioms outlined above, there is an addition axiom for describing how one operations distributes over the other.

1. Distributivity: $a \cdot (b + c) = (a \cdot b) + (a \cdot c), (a + b) \cdot c = (a \cdot c) + (b \cdot c)$

All of the following ring structures satisfy axiom (6). Rings are distinguished by the properties of their operands. The simplest of these structures is the Ringoid where both operands are given by Groupoids. Any Ringoid whose operands are Semi-groups is a Semi-ring. Any Semi-ring whose first operand is a Group is a Ring. Any Ring whose second operand is a Monoid is a Ring with Unity. Any Ring with Unity whose second operand is a Group is Division Ring. Any Division Ring whose operands are both Abelian Groups is a Field.

public interface IRingoid<T, A, M>
where A : IGroupoid<T>
where M : IGroupoid<T> {
M Multiplication { get; }

T Distribute(T a, T b);
}

public interface ISemiring<T, A, M> : IRingoid<T, A, M>
where A : ISemigroup<T>
where M : ISemigroup<T> {

}

public interface IRing<T, A, M> : ISemiring<T, A, M>
where A : IGroup<T>
where M : ISemigroup<T> {

}

public interface IRingWithUnity<T, A, M> : IRing<T, A, M>
where A : IGroup<T>
where M : IMonoid<T> {

}

public interface IDivisionRing<T, A, M> : IRingWithUnity<T, A, M>
where A : IGroup<T>
where M : IGroup<T> {

}

public interface IField<T, A, M> : IDivisionRing<T, A, M>
where A : IAbelianGroup<T>
where M : IAbelianGroup<T> {

}


#### Module Theory

The last, and more familiar, sub-field of Abstract Algebra is Module Theory which deals with structures with an operation, $(\circ) : S \times R \to R$, over two separate sets: $a,b \in S$ and $x,y \in R$ that satisfy the following axioms.

1. Distributivity of $S$: $a \circ (x + y) = (a \circ x) + (a \circ y)$
2. Distributivity of $R$: $(a + b) \circ x = (a \circ x) + (b \circ x)$
3. Associativity of $S$: $a \circ (b \circ x) = (a \cdot b) \circ x$

All of the following module structures satisfy axioms (7)-(9). A Module consists of a scalar Ring and an vector Abelian Group. Any Module whose Ring is a Ring with Unity is a Unitary Module. Any Unitary Module whose Ring with Unity is a Abelian Group is a Vector Space.

public interface IModule<
TScalar,
TVector,
TScalarRing,
TScalarMultiplicativeSemigroup,
>
where TScalarRing : IRing<TScalar, TScalarAddativeGroup, TScalarMultiplicativeSemigroup>
where TScalarMultiplicativeSemigroup : ISemigroup<TScalar>
{

TScalarRing Scalar { get; }

TVector Distribute(TScalar t, TVector r);
}

public interface IUnitaryModule<
TScalar,
TVector,
TScalarRingWithUnity,
TScalarMultiplicativeMonoid,
>
: IModule<
TScalar,
TVector,
TScalarRingWithUnity,
TScalarMultiplicativeMonoid,
>
where TScalarRingWithUnity : IRingWithUnity<TScalar, TScalarAddativeGroup, TScalarMultiplicativeMonoid>
where TScalarMultiplicativeMonoid : IMonoid<TScalar>
{

}

public interface IVectorSpace<
TScalar,
TVector,
TScalarField,
TScalarMultiplicativeAbelianGroup,
>
: IUnitaryModule<
TScalar,
TVector,
TScalarField,
TScalarMultiplicativeAbelianGroup,
>
where TScalarField : IField<TScalar, TScalarAddativeAbelianGroup, TScalarMultiplicativeAbelianGroup>
where TScalarMultiplicativeAbelianGroup : IAbelianGroup<TScalar>
{

}


### Representation of Value Types

The CTS allows for both value and reference types on the .NET Common Language Infrastructure (CLI). The following are examples of how each theory presented above can leverage value types found in the C# language to represent concepts drawn from mathematics.

#### Enum Value Types and the Dihedral Group $D_8$

One of the simplest finite groups is the Dihedral Group of order eight, $D_{8}$, representing the different orientations of a square, $e$, obtained by reflecting the square about the vertical axis, $b$, and rotating the square by ninety degrees, $a$. The generating set is given by $\lbrace a, b \rbrace$ and gives rise to the set $\lbrace e, a, a^2, a^3, b, ba, ba^2, ba^3 \rbrace$ These elements are assigned names as follows: $\text{Rot}(0) = e$, $\text{Rot}(90) = a$, $\text{Rot}(180) = a^2$, $\text{Rot}(270) = a^3$, $\text{Ref}(\text{Ver}) = b$, $\text{Ref}(\text{Desc}) = ba$, $\text{Ref}(\text{Hoz}) = ba^2$ and $\text{Ref}(\text{Asc}) = ba^3$. The relationship between these elements is visualized below:

The easiest way to represent this group as a value type is with an enum.

enum Symmetry { Rot000, Rot090, Rot180, Rot270, RefVer, RefDes, RefHoz, RefAsc }


From this enum we can define the basic Group Theory algebraic structures to take us to $D_8$.

public class SymmetryGroupoid : IGroupoid<Symmetry> {
public Symmetry Operation(Symmetry a, Symmetry b) {
// 64 cases
}
}

public class SymmetrySemigroup : SymmetryGroupoid, ISemigroup<Symmetry> {

}

public class SymmetryMonoid : SymmetrySemigroup, IMonoid<Symmetry> {
public Symmetry Identity {
get { return Symmetry.Rot000; }
}
}

public class SymmetryGroup : SymmetryMonoid, IGroup<Symmetry> {
public Symmetry Inverse(Symmetry a) {
switch (a) {
case Symmetry.Rot000:
return Symmetry.Rot000;
case Symmetry.Rot090:
return Symmetry.Rot270;
case Symmetry.Rot180:
return Symmetry.Rot270;
case Symmetry.Rot270:
return Symmetry.Rot090;
case Symmetry.RefVer:
return Symmetry.RefVer;
case Symmetry.RefDes:
return Symmetry.RefAsc;
case Symmetry.RefHoz:
return Symmetry.RefHoz;
case Symmetry.RefAsc:
return Symmetry.RefDes;
}

throw new NotImplementedException();
}

}


#### Integral Value Types and the Commutative Ring with Unity over $\mathbb{Z} / 2^n \mathbb{Z}$

C# exposes a number of fixed bit integral value types that allow a programmer to pick an integral value type suitable for the scenario at hand. Operations over these integral value types form a commutative ring with unity whose set is the congruence class $\mathbb{Z} / 2^n \mathbb{Z} = \lbrace \overline{0}, \overline{1}, \ldots, \overline{2^n-1} \rbrace$ where $n$ is the number of bits used to represent the integer and $\overline{m}$ is the equivalance class $\overline{m} = \lbrace m + k \cdot 2^n\rbrace$ with $k \in \mathbb{Z}$.

Addition is given by $\overline{a} + \overline{b} = \overline{(a + b)}$ just as multiplication is given by $\overline{a} \cdot \overline{b} = \overline{(a \cdot b)}$. Both statements are equivalent to the following congruence statements: $(a + b) \equiv c \pmod{2^n}$ and $(a \cdot b) \equiv c \pmod{2^n}$ respectively.

Under the binary numeral system, modulo $2^n$ is equivalent to ignoring the bits exceeding $n-1$, or equivalently, $\displaystyle \sum_{i = 0}^{\infty} c_i 2^i \equiv \sum_{i = 0}^{n-1} c_i 2^i \pmod{2^n}$ where $c \in \lbrace 0, 1 \rbrace$. As a result only the first (right most) bits need to be considered when computing the sum or product of two congruence classes, or in this case, integer values in C#. Thus, in the following implementation, it is not necessary to write any extra code to represent these operations other than writing them in their native form.

The reason why we are limited to a commutative ring with unity instead of a full field is that multiplicative inverses do not exist for all elements. A multiplicative inverse only exists when $ax \equiv 1 \pmod{2^n}$ where $x$ is the multiplicative inverse of $a$. For a solution to exist, $\gcd(a, 2^n) = 1$. Immediately, any even value of $a$ will not have a multiplicative inverse in $\mathbb{Z} \ 2^n \mathbb{Z}$. However, all odd numbers will.

public class AddativeIntegerGroupoid : IGroupoid<long> {
public long Operation(long a, long b) {
return a + b;
}
}

}

public long Identity {
get { return 0L; }
}
}

public long Inverse(long a) {
return -a;
}
}

}

public class MultiplicativeIntegerGroupoid : IGroupoid<long> {
public long Operation(long a, long b) {
return a * b;
}
}

public class MultiplicativeIntegerSemigroup : MultiplicativeIntegerGroupoid, ISemigroup<long> {

}

public class MultiplicativeIntegerMonoid : MultiplicativeIntegerSemigroup, IMonoid<long> {
public long Identity {
get { return 1L; }
}
}

public class IntegerRingoid : IRingoid<long, AddativeIntegerGroupoid, MultiplicativeIntegerGroupoid> {
public MultiplicativeIntegerGroupoid Multiplication { get; private set;}

public IntegerRingoid() {
Multiplication = new MultiplicativeIntegerGroupoid();
}

public long Distribute(long a, long b) {
return Multiplication.Operation(a, b);
}
}

public class IntegerSemiring : IntegerRingoid, ISemiring<long, AddativeIntegerSemiring, MultiplicativeIntegerSemiring> {
public MultiplicativeIntegerSemiring Multiplication { get; private set;}

public IntegerSemiring() : base() {
Multiplication = new MultiplicativeIntegerSemiring();
}
}

public class IntegerRing : IntegerSemiring, IRing<long, AddativeIntegerGroup, MultiplicativeIntegerSemigroup>{

public IntegerRing() : base() {
}
}

public class IntegerRingWithUnity : IntegerRing, IRingWithUnity<long, AddativeIntegerGroup, MultiplicativeIntegerMonoid> {
public MultiplicativeIntegerMonoid Multiplication { get; private set; }

public IntegerRingWithUnity() : base() {
Multiplication = new MultiplicativeIntegerMonoid();
}
}


#### Floating-point Value Types and the Real Vector Space $\mathbb{R}^n$

C# offers three types that approximate the set of Reals: floats, doubles and decimals. Floats are the least representative followed by doubles and decimals. These types are obviously not continuous, but the error involved in rounding calculations with respect to the calculations in question are negligible and for most intensive purposes can be treated as continuous.

As in the previous discussion on the integers, additive and multiplicative classes are defined over the algebraic structures defined in the Group and Ring Theory sections presented above. In addition to these implementations, an additional class is defined to describe a vector.

public class Vector<T> {
private T[] vector;

public int Dimension {
get { return vector.Length; }
}

public T this[int n] {
get { return vector[n]; }
set { vector[n] = value; }
}

public Vector() {
vector = new T[2];
}
}


With these classes, it is now possible to implement the algebraic structures presented in the Module Theory section from above.

public class RealVectorModule : IModule<double, Vector<double>, RealRing, AddativeRealGroup, MultiplicativeRealSemigroup, VectorAbelianGroup<double>> {
public RealRing Scalar {
get;
private set;
}

public VectorAbelianGroup<double> Vector {
get;
private set;
}

public RealVectorModule() {
Scalar = new RealRing();
}

public Vector<double> Distribute(double t, Vector<double> r) {
Vector<double> c = new Vector<double>();
for (int i = 0; i < c.Dimension; i++)
c[i] = Scalar.Multiplication.Operation(t, r[i]);
return c;
}
}

public class RealVectorUnitaryModule : RealVectorModule, IUnitaryModule<double, Vector<double>, RealRingWithUnity, AddativeRealGroup, MultiplicativeRealMonoid, VectorAbelianGroup<double>> {
public new RealRingWithUnity Scalar {
get;
private set;
}

public RealVectorUnitaryModule()
: base() {
Scalar = new RealRingWithUnity();
}
}

public class RealVectorVectorSpace : RealVectorUnitaryModule, IVectorSpace<double, Vector<double>, RealField, AddativeRealAbelianGroup, MultiplicativeRealAbelianGroup, VectorAbelianGroup<double>> {
public new RealField Scalar {
get;
private set;
}

public RealVectorVectorSpace()
: base() {
Scalar = new RealField();
}
}


### Representation of Reference Types

The following are examples of how each theory presented above can leverage reference types found in the C# language to represent concepts drawn from computer science.

#### Strings, Computability and Monoids

Strings are the simplest of reference types in C#. From an algebraic structure point of view, the set of possible strings, $\Sigma^{*}$, generated by an alphabet, $\Sigma$, and paired with a concatenation operation, $(+)$, forms a monoid.

public class StringGroupoid : IGroupoid<string> {
public string Operation(String a, String b) {
return string.Format("{0}{1}", a, b);
}
}

public class StringSemigroup : StringGroupoid, ISemigroup<string> {

}

public class StringMonoid : StringSemigroup, IMonoid<string> {
public string Identity {
get { return string.Empty; }
}
}


Monoids over strings have a volley of applications in the theory of computation. Syntactic Monoids describe the smallest set that recognizes a formal language. Trace Monoids describe concurrent programming by allowing different characters of an alphabet to represent different types of locks and synchronization points, while the remaining characters represent processes.

#### Classes, Interfaces, Type Theory and Semi-rings

Consider the set of types $\mathcal{T}^{*}$, that are primitive and constructed in C#. The generating set of $\mathcal{T}^{*}$ is the set of primitive reference and value types, $\mathcal{T}$, consisting of the types discussed thus far. New types can be defined by defining classes and interfaces.

A simple operation $(\oplus)$ over $\mathcal{T}^{*}$ takes two types, $\alpha, \beta$, and yields a third type, $\gamma$, known as a sum type. In type theory, this means that an instance of $\gamma$ can be either an instance of $\alpha$ or $\beta$. A second operation $(\otimes)$ over $\mathcal{T}^{*}$ takes two types and yields a third type representing a tuple of the first two types. In other words, $\alpha \otimes \beta = (\alpha, \beta)$.

Both operations form a semi-group $(\mathcal{T}^{*}, \otimes)$ and $(\mathcal{T}^{*}, \otimes)$ and in conjunction the two form a semi-ring.

To implement this semi-ring is a little involved. The .NET library supports emitting dynamic type definitions at runtime. For sum types, this would lead to an inheritance view of the operation. Types $\alpha$ and $\beta$ would end up deriving from $\gamma$ which means that any sequence of sums would yield an inheritance tree. A product type would result in composition of types with projection operations, $\pi_{n} : \prod_{i = 0} \tau_{i} \to \tau_{n}$, to access and assign the $n\text{'th}$ element of the composite. Both type operation implementations are outside the scope of this write-up and I’ll likely revisit this topic in a future write-up.

#### Delegates and Process Algebras

The third type of reference type to mention is the delegate type which is C#’s approach to creating first-class functions. The simplest of delegates is the built-in Action delegate which represents a single procedure taking no inputs and returning no value.

Given actions $a, b \in \mathcal{A}$, we can define a possible execution operator, $(\Vert) : \mathcal{A} \times \mathcal{A} \to \mathcal {A}$, where either $a$ or $b$ is executed denoted as $a \lVert b$. The choice operation forms a commutative semigroup $(\mathcal{A}, \lVert)$ since operations are associative $a \lVert (b \lVert c) = (a \lVert b) \lVert c$ and the operation is commutative $a \lVert b = b \lVert a$.

A product operation, $(\to) : \mathcal{A} \times \mathcal{A} \to \mathcal {A}$, representing the sequential execution of $a$ and then $b$ is given by $a \to b$. The sequence operator forms a groupoid with unity since the operation is not associative $a \to (b \to c) \neq (a \to b) \to c$ and there is an identity action $e$ representing a void operation resulting in $e \to a = a$.

Both operations together form a ringoid, $(\mathcal{A}, \to, \lVert)$ since the sequence operation distributes over the choice operation $a \to (b \lVert c) = (a \to b) \lVert (a \to c)$. Meaning that $a$ takes place and then $b$ or $c$ takes places is equivalent to $a$ and then $b$ takes place or $a$ and then $c$ takes place.

public class SequenceGroupoidWithUnity<Action> : IGroupoid<Action> {
public Action Identity {
get { return () => {}; }
}

public Action Operation(Action a, Action b) {
return () => { a(); b(); }
}
}

public class ChoiceGroupoid<Action> : IGroupoid<Action> {
public Action Operation(Action a, Action b) {
if(DateTime.Now.Ticks % 2 == 0)
return a;
return b;
}
}


The process algebra an be extended further to describe parallel computations with an additional operation. The operations given thus far enable one to derive the possible execution paths in a process. This enables one to comprehensively test each execution path to achieve complete test coverage.

### Examples

The motivation of this work was to achieve C++’s approach to parametric polymorphism by utilizing C# subtype polymorphism to define the algebraic structure required by a method (akin to the built-in operations on types in C++). To illustrate how these interfaces are to be used, the following example extension methods operate over a collection of a given type and accept the minimal algebraic structure to complete the computation. The result is a single implementation of the calculation that one would expect in C++.

static public class GroupExtensions {
static public T Sum<T>(this IEnumerable<T> E, IMonoid<T> m) {
return E
.FoldL(m.Identity, m.Operation);
}
}

static public class RingoidExtensions {
static public T Count<T, R, A, M>(this IEnumerable<R> E, IRingWithUnity<T, A, M> r)
where A : IGroup<T>
where M : IMonoid<T> {

return E
.Map((x) => r.Multiplication.Identity)
}

static public T Mean<T, A, M>(this IEnumerable<T> E, IDivisionRing<T, A, M> r)
where A : IGroup<T>
where M : IGroup<T> {

return r.Multiplication.Operation(
r.Multiplication.Inverse(
E.Count(r)
),
);
}

static public T Variance<T, A, M>(this IEnumerable<T> E, IDivisionRing<T, A, M> r)
where A : IGroup<T>
where M : IGroup<T> {

T average = E.Mean(r);

return r.Multiplication.Operation(
r.Multiplication.Inverse(
E.Count(r)
),
E
.Map((x) => r.Multiplication.Operation(x, x) )
);
}
}

static public class ModuleExtensions {
static public TV Mean<TS, TV, TSR, TSRA, TSRM, TVA>(this IEnumerable<TV> E, IVectorField<TS, TV, TSR, TSRA, TSRM, TVA> m)
where TSR : IField<TS, TSRA, TSRM>
where TSRA : IAbelianGroup<TS>
where TSRM : IAbelianGroup<TS>
where TVA : IAbelianGroup<TV> {

return m.Distribute(
m.Scalar.Multiplication.Inverse(
E.Count(m.Scalar)
),
E.FoldL(
m.Vector.Identity,
m.Vector.Operation
)
);
}
}


### Conclusion

Abstract Algebra comes with a rich history and theory for dealing with different algebraic structures that are easily represented and used in the C# language to perform type agnostic computations. Several examples drawn from mathematics and computer science illustrated how the solution can be used for both value and reference types in C# and be leveraged in the context of a few example type agnostic computations. The main benefit of this approach is that it minimizes the repetitious coding of computations required under the ad-hoc polymorphism approach adopted by the designers of C# language. The downside is that several structures must be defined for the types being computed over and working with C# parameter constraint system can be unwieldy. While an interesting study, this solution would not be practical in a production setting under the current capabilities of the C# language.

### References

Baeten, J.C.M. A Brief History of Process Algebra [pdf]. Department of Computer Science, Technische Universiteit Eindhoven. 31 Mar. 2012.

ECMA International. Standard ECMA-335 Common Language Infrastructure [pdf]. 2006.

Fokkink, Wan. Introduction of Process Algebra [pdf]. 2nd ed. Berlin: Springer-Verlang, 2007. 10 Apr. 2007. 31 Mar. 2012.

Goodman, Joshua. Semiring Parsing [pdf]. Computational Linguistics 25 (1999): 573-605. Microsoft Research. 31 Mar. 2012.

Hungerford, Thomas. Algebra. New York: Holt, Rinehart and Winston, 1974.

Ireland, Kenneth. A classical introduction to modern number theory. New York: Springer-Verlag, 1990.

Litvinov, G. I., V. P. Maslov, and A. YA Rodionov. Universal Algorithms, Mathematics of Semirings and Parallel Computations [pdf]. Spring Lectures Notes in Computational Science and Engineering. 7 May 2010. 31 Mar. 2012 .

Mazurkiewicz, Antoni. Introduction to Trace Theory [pdf]. Rep. 19 Nov. 1996. Institute of Computer Science, Polish Academy of Sciences. 31 Mar. 2012.

Pierce, Benjamin. Types and programming languages. Cambridge, Mass: MIT Press, 2002.