Archive for the ‘Probability and Statistics’ Category
GPU Accelerated Expectation Maximization for Gaussian Mixture Models using CUDA
C, CUDA, and Python source code available on GitHub
Introduction
Gaussian Mixture Models [1, 435439] offer a simple way to capture complex densities by employing a linear combination of multivariate normal distributions, each with their own mean, covariance, and mixture coefficient, , s.t. .
Of practical interest is the learning of the number of components and the values of the parameters. Evaluation criteria, such as Akaike and Bayesian, can be used to identify the number of components, or nonparametric models like Dirichlet processes can be used to avoid the matter all together. We won’t cover these techniques here, but will instead focus on finding the values of the parameters given sufficient training data using the ExpectationMaximization algorithm [3], and doing so efficiently on the GPU. Technical considerations will be discussed and the work will conclude with an empirical evaluation of sequential and parallel implementations for the CPU, and a massively parallel implementation for the GPU for varying numbers of components, points, and point dimensions.
Multivariate Normal Distribution
The multivariate normal distribution With mean, , and symmetric, positive definite covariance, , is given by:
From a computational perspective, we will be interested in evaluating the density for values. Thus, a naive implementation would be bounded by due to the matrix determinate in the normalization term. We can improve upon this by computing the Cholesky factorization, , where is a lower triangular matrix [6, 157158]. The factorization requires time and computing the determinate becomes by taking advantage of the fact that . Further, we can precompute the factorization and normalization factor for a given parameterization which leaves us with complexity of the Mahalanobis distance given by the quadratic form in the exponential. Naive computation requires one perform two vector matrix operations and find the inverse of the covariance matrix with worst case behavior . Leveraging the Cholesky factorization, we’ll end up solving a series of triangular systems by forward and backward substitution in and completing an inner product in as given by , , and . Thus, our preinitialization time is and density determination given by . Further optimizations are possible by considering special diagonal cases of the covariance matrix, such as the isotropic, , and nonisotropic, , configurations. For robustness, we’ll stick with the full covariance.
To avoid numerical issues such as overflow and underflow, we’re going to consider throughout the remainder of the work. For estimates of the covariance matrix, we will want more samples than the dimension of the data to avoid a singular covariance matrix [4]. Even with this criteria satisfied, it may still be possible to produce a singular matrix if some of the data are collinear and span a subspace of .
Expectation Maximization
From an unsupervised learning point of view, GMMs can be seen as a generalization of kmeans allowing for partial assignment of points to multiple classes. A possible classifier is given by . Alternatively, multiple components can be used to represent a single class and we argmax over the corresponding subset sums. The utility of of GMMs goes beyond classification, and can be used for regression as well. The ExpectationMaximization (EM) algorithm will be used to find the parameters of of the model by starting with an initial guess for the parameters given by uniform mixing coefficients, means determined by the kmeans algorithm, and spherical covariances for each component. Then, the algorithm iteratively computes probabilities given a fixed set of parameters, then updating those parameters by maximizing the loglikelihood of the data:
Because we are dealing with exponents and logarithms, it’s very easy to end up with underflow and overflow situations, so we’ll continue the trend of working in logspace and also make use of the “logsumexp trick” to avoid these complications:
Where the term is the maximum exponential argument within a stated sum. Within the expectation stage of the algorithm we will compute the posterior distributions of the components conditioned on the training data (we omit the mixing coefficient since it cancels out in the maximization steps of and , and account for it explicitly in the update of ):
The new parameters are resolved within the maximization step:
The algorithm continues back and forth between expectation and maximization stages until the change in log likelihood is less than some epsilon, or a maximum number of user specified iterations has elapsed.
Implementations
Sequential Per iteration complexity given by . We expect because too many dimensions leads to a lot of dead space and too many components results in overfitting of the data. Thus, the dominating term for sequential execution is given by .
Parallel There are two natural data parallelisms that appear in the algorithm. The calculation of the and across points, while the probability densities and parameter updates have natural parallelisms across components. Each POSIX thread runs the full iterative algorithm with individual stages coordinated by barrier synchronization. Resulting complexity is given by for work coordinated across processors.
Massively Parallel The parallel implementation can be taken and mapped over to the GPU with parallelism taken across points and components depending on the terms being computed. There are several types of parallelism that we will leverage under the CUDA programming model. For the calculation of we compute each point in parallel by forming a grid of one dimensional blocks, and use streams with event synchronization to carry out each component in parallel across the streaming multiprocessors. Calculation of the loglikelihood and is done by computing and storing , then updating the storage for , and then performing a parallel reduction over to produce the loglikelihood. Parallel reductions are a core tasks are implemented by first standardizing the input array of points to an supremum power of two, then reducing each block using shared memory, and applying a linear map to the memory so that successive block reductions can be applied. Several additional approaches are discussed in [5]. Once the loglikelihood is computed, the streams are synchronized with the host and the result is copied from the device back to the host. To compute , is copied to a working memory and a maximum parallel reduction is performed. The resulting maximum is used in a separate exponential map for numerical stability when computing the parallel reduction of each component to yield . Updates to the mean and covariances are performed by mapping each term to a working memory allocated for each component’s stream and executing a parallel reduction to yield the updated mean and covariance. Once all component streams have been synchronized, the mixture coefficients and Cholesky decompositions of the covariances is computed with a single kernel invocation parallel in the number of components.
The main design consideration was whether or not use streams. For larger numbers of components, this will result in improved runtime performance, however, it comes at the cost of increased memory usage which limits the size of problems an end user can study with the implementation. Because the primary design goal is performance, the increase in memory was favorable to using less memory and executing each component sequentially.
To optimize the runtime of the implementation nvprof along with the NVIDIA Visual Profiler was used to identify performance bottlenecks. The original implementation was a naive port of the parallel C code which required frequent memory transfers between host and device resulting in significant CUDA API overhead that dominated the runtime. By transferring and allocating memory on the device beforehand, this allowed the implementation to execute primarily on the GPU and eliminate the API overhead. The second primary optimization was using streams and events for parallelization of the component probability densities and parameter updates in the maximization step. In doing so, this allowed for a fold reduction since the components calculations would be performed in parallel. The next optimization step was to streamline the parallel reductions by using block reductions against fast shared block memory minimizing the number of global memory writes instead of performing iterated reductions against sequential addressing that preformed global memory reads and writes for each point. The final optimization step was to used pinned host memory to enable zerocopy transfers from DRAM to the GPU over DMA.
Evaluation
To evaluate the implementations we need a way of generating GMMs and sampling data from the resulting distributions. To sample from a standard univariate normal distribution one can use The BoxMuller transform, Zigguart method, or Ratioofuniforms method [7]. The latter is used here due to its simplicity and efficiency. Sampling from the multivariate normal distribution can by done by sampling a standard normal vector and computing where can be computed by Eigendecomposition, , or Cholesky factorization, . The latter is used since it is more efficient. The GMM describes a generative process whereby we pick a component at random with probability given by its mixture coefficient and then sample the underlying distribution, and perform this process for the desired number of points.
The matter of generating GMMs it more interesting. Here we draw for , alternatively, one could draw . Means are drawn by with so that means are relatively spread out in . The more exciting prospect is how to sample the covariance matrix. This is where the Wishart distribution, for , comes in handy. The Wishart distribution is a model of what the sample covariance matrix should look like given a series of vectors. Based on a method by [8], [9] gives an equally efficient method for sampling by letting and for and .
To evaluate the performance of the different implementations, the wall clock time taken to run the algorithm on a synthetic instance was measured by varying each of the , , and parameters while holding the other two fixed. From an end user perspective wall clock time is preferable to the time the operating system actually devoted to the problem since wall clock time is more valuable. There will be variability in the results since each instance requires a different number of iterations for the log likelihood to converge. Tests were conducted on a Xeon 1245 v5 3.5 Ghz system with 32GB of memory and an NVIDIA GTX 1060 6GB graphics card with 1280 cores.
Since the parameter space is relatively large Figures 25 look at varying one parameter will fixing the others to demonstrate the relative merits of each approach. When the number of points dominates the CUDA approach tends to be 18x faster; the Parallel approach tends to be 3x faster when the dimension is high; and CUDA is suitable when the num of components is high giving a 20x improvement relative to the sequential approach. Thus, when dealing with suitably large datasets, the CUDA based implementation is preferable delivering superior runtime performance without sacrificing quality.
It is important to note that the results obtained from the CUDA solution may differ to those the sequential and parallel approaches. This is due to nondeterministic round off errors associated with executing parallel reductions compared to sequential reductions [2], and differences in the handling of floating point values on the GPU [10], notably, the presence of fused multiple add on NVIDIA GPUs which are more accurate than what is frequently implemented in CPU architectures. The following two synthetic data sets illustrate typical results of the three schemes:
Conclusion
This work demonstrated the utility of using NVIDIA GPUs to train Gaussian mixture models by the Expectation Maximization algorithm. Speedups as high as 20x were observed on synthetic datasets by varying the number of points, components, and data dimension while leaving the others fixed. It is believed that further speedups should be possible with additional passes, and the inclusion of metric data structures to limit which data is considered during calculations. Future work would pursue more memory efficient solutions on the GPU to allow for larger problem instance, and focus on providing higher level language bindings so that it can be better utilized in traditional data science toolchains.
References
 Bishop, C. M. Pattern recognition and machine learning. Springer, 2006.
 Collange, S., Defour, D., Graillat, S., and Lakymhuk, R. Numerical reproducibility for the parallel reduction on multi and manycore architectures. Parallel Computing 49 (2015), 8397.
 Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the eme algorithm. Journal of the royal statistical society. Series B (methodological) (1977), 138.
 Fan, J., Liao, Y., and Liu, H. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal 19, (2016) C1C32.
 Harris, M. Optimizing cuda. SC07: High Performance Computing with CUDA (2007).
 Kincaid, D., and Cheney, W. Numerical analysis: mathematics of scientific computing. 3 ed. Brooks/Cole, 2002.
 Kinderman, A. J., and Monahan, J. F. Computer generation of random variables using the ratio of uniform deviates. ACM Transactions on Mathematical Software (TOMS) 3, 3 (1977), 257260.
 Odell, P., and Feiveson, A. A Numerical procedure to generate a sample covariance matrix. Journal of the American Statistical Association 61, 313 (1966), 199203.
 Sawyer, S. Wishart distributions and inversewishart sampling. URL: http://www.math.wustl.edu/~sawyer/hmhandouts/Wishart.pdf (2007).
 Whitehead, N., and FitFlorea, A. Precision and performance: Floating point and ieee 754 compliance for nvidia gpus. rn(A + B) 21., 1 (2011), 1874919424.
Expected Maximum and Minimum of RealValued Continuous Random Variables
Introduction
This is a quick paper exploring the expected maximum and minimum of realvalued continuous random variables for a project that I’m working on. This paper will be somewhat more formal than some of my previous writings, but should be an easy read beginning with some required definitions, problem statement, general solution and specific results for a small handful of continuous probability distributions.
Definitions
Definition (1) : Given the probability space, , consisting of a set representing the sample space, , a , , and a Lebesgue measure, , the following properties hold true:
 Nonnegativity:
 Null empty set:
 Countable additivity of disjoint sets
Definition (2) : Given a realvalued continuous random variable such that , the event the random variable takes on a fixed value, , is the event measured by the probability distribution function . Similarly, the event that the random variable takes on a range of values less than some fixed value, , is the event measured by the cumulative distribution function . By Definition, the following properties hold true:
Defintion (3) : Given a second realvalued continuous random variable, , The joint event will be measured by the joint probability distribution . If and are statistically independent, then .
Definition (4) : Given a realvalued continuous random variable, , the expected value is .
Definition (5) : (Law of the unconscious statistician) Given a realvalued continuous random variable, , and a function, , then is also a realvalued continuous random variable and its expected value is provided the integral converges. Given two realvalued continuous random variables, , and a function, , then is also a realvalued continuous random variable and its expected value is . Under the independence assumption of Definition (3), the expected value becomes .
Remark (1) : For the remainder of this paper, all realvalued continuous random variables will be assumed to be independent.
Problem Statement
Theorem (1) : Given two realvalued continuous random variables , then the expected value of the minimum of the two variables is .
Lemma (1) : Given two realvalued continuous random variables , then the expected value of the maximum of the two variables is
Proof of Lemma (1) :
(Definition (5))
(Definition (1.iii))
(Fubini’s theorem)
(Definition (2.i))
Proof of Theorem (1)
(Definition (4))
(Definition (1.iii))
(Fubini’s theorem)
(Definition (2.iii))
(Definition (2.i))
(Definition (4), Lemma (1))
Remark (2) : For real values , .
Proof Remark (2) : If , then , otherwise . If , then , otherwise . If , then , otherwise, . Therefore,
Worked Continuous Probability Distributions
The following section of this paper derives the expected value of the maximum of realvalued continuous random variables for the exponential distribution, normal distribution and continuous uniform distribution. The derivation of the expected value of the minimum of realvalued continuous random variables is omitted as it can be found by applying Theorem (1).
Exponential Distribution
Definition (6) : Given a realvalued continuous exponentially distributed random variable, , with rate parameter, , the probability density function is for all and zero everywhere else.
Corollary (6.i) The cumulative distribution function of a realvalued continuous exponentially distributed random variable, , is therefore for all and zero everywhere else.
Proof of Corollary (6.i)
Corollary (6.ii) : The expected value of a realvalued continuous exponentially distributed random variable, , is therefore .
Proof of Corollary (6.ii)
The expected value is by Definition (4) and Lemma (2) .
Lemma (2) : Given real values , then .
Proof of Lemma (2) :
Theorem (2) : The expected value of the maximum of the realvalued continuous exponentially distributed random variables , is .
Proof of Theorem (2) :
(Lemma (1))
(Corollary (6.i))
(Integral linearity)
(Lemma (2), Corollary (6.ii))
Normal Distribution
Definition (7) : The following Gaussian integral is the error function for which the following properties hold true:
 Odd function:
 Limiting behavior:
Definition (8) : Given a realvalued continuous normally distributed random variable , , with mean parameter, . and standard deviation parameter, , the probability density function is for all values on the real line.
Corollary (8.i) : The cumulative distribution function of a realvalued continuous normally distributed random variable, , is therefore .
Proof of Corollary (8.i) :
(Definition (2.i))
(Usubstitution with )
(Definition (2.iii))
(Reverse limits of integration)
(Definition (7))
(Definition (7.i))
(Definition (7.ii))
Corollary (8.ii) : The expected value of a realvalued continuous normally distributed random variable, , is therefore .
Proof of Corollary (8.ii) :
(Definition (4))
(Usubstitution with )
(Integral linearity)
(Definition (1.iii))
( is odd, is even)
(Definition (7), Definition (7.ii))
Definition (9) : Given a realvalued continuous normally distributed random variable, , the probability distribution function will be denoted as standard normal probability distribution function, , and the cumulative distribution function as the standard normal cumulative distribution function, . By definition, the following properties hold true:
 Nonstandard probability density function: If , then
 Nonstandard cumulative distribution function: If , then
 Complement:
Definition (10) : [PaRe96] Given and , the following integrals hold true:
Theorem (3) : The expected value of the maximum of the realvalued continuous normally distributed random variables , is .
Lemma (3) : Given realvalued continuous normally distributed random variables , , .
Proof of Lemma (3) :
(Definition (9.i), Definition (9.ii))
(Usubstitution with , )
(Integral linearity)
(Definition (10.i), Definition (10.ii))
Proof of Theorem (3) :
(Lemma (1))
(Definition (11.i), Definition (11.ii))
(Lemma (3))
(Definition (9.iii))
Continuous Uniform Distribution
Definition (11) : Given a realvalued continuous uniformly distributed random variable, , with inclusive boundaries such that , the probability density function is for all and zero everywhere else.
Corollary (11.i) : The cumulative distribution function of a realvalued continuous uniformly distributed random variable, , is therefore .
Proof of Corollary (11.i) :
.
Corollary (11.ii) : The expected value of a realvalued continuous uniformly distributed random variable, , is therefore .
Proof of Corollary (11.ii)
Theorem (4) : The expected value of the maximum of realvalued continuous uniformly distributed random variables , is .
Proof of Theorem (4) :
(Lemma (1))
Case (1) :
Case (2) :
Case (3) :
Case (4) :
Case (5) :
Case (6) :
Summary Table
The following summary table lists the expected value of the maximum of realvalued continuous random variables for the exponential distribution, normal distribution and continuous uniform distribution. The corresponding minimum can be obtained by Theorem (1).
Random Variables  Maximum  

References
[GrSt01] Grimmett, Geoffrey, and David Stirzaker. Probability and Random Processes. Oxford: Oxford UP, 2001. Print.
[PaRe96] Patel, Jagdish K., and Campbell B. Read. Handbook of the Normal Distribution. 2nd ed. New York: Marcel Dekker, 1996. Print.
Minesweeper Agent
Introduction
Lately I’ve been brushing up on probability, statistics and machine learning and thought I’d play around with writing a Minesweeper agent based solely on these fields. The following is an overview of the game’s mechanics, verification of an implementation, some different approaches to writing the agent and some thoughts on the efficacy of each approach.
Minesweeper
Background
Minesweeper was created by Curt Johnson in the late eighties and later ported to Windows by Robert Donner while at Microsoft. With the release of Windows 3.1 in 1992, the game became a staple of the operating system and has since found its way onto multiple platforms and spawned several variants. The game has been shown to be NPComplete, but in practice, algorithms can be developed to solve a board in a reasonable amount of time for the most common board sizes.
Specification
Gameplay 

An agent, , is presented a grid containing uniformly distributed mines. The agent’s objective is to expose all the empty grid locations and none of the mines. Information about the mines’ grid locations is gained by exposing empty grid locations which will indicate how many mines exist within a unit (Chebyshev) distance of the grid location. If the exposed grid location is a mine, then the player loses the game. Otherwise, once all empty locations are exposed, the player wins.  
Initialization 

The board consists of hidden and visible states. To represent the hidden, , and visible state, , of the board, two character matrices of dimension are used.
Characters ‘0’‘8’ represent the number of neighboring mines, character ‘U’ to represent an unexposed grid location and character ‘*’ for a mine. Neighbors of a grid location is the set of grid locations such that . By default, . 

Exposing Cells 

The expose behavior can be thought of as a flood fill on the grid, exposing any empty region bordered by grid locations containing mine counts and the boundaries of the grid.
A matrix, , represents the topography of the board. A value of zero is reserved for sections of the board that have yet to be visited, a value of one for those that have, two for those that are boundaries and three for mines. A stack, , keeps track of locations that should be inspected. If a cell location can be exposed, then each of its neighbors will be added to the stack to be inspected. Those neighbors that have already been inspected will be skipped. Once all the reachable grid locations have been inspected, the process terminates. 
Verification
Methodology
Statistical tests are used to verify the random aspects of the game’s implementation. I will skip the verification of the game’s logic as it requires use of a number of different methods that are better suited for their own post.
There are two random aspects worth thinking about: the distribution of mines and the distribution of success (i.e., not clicking a mine) for random trials. In both scenarios it made since to conduct Pearson’s chisquared test. Under this approach there are two hypotheses:
 : The distribution of experimental data follows the theoretical distribution
 : The distribution experimental data does not follow the theoretical distribution
is accepted when the test statistic, , is less than the critical value, . The critical value is determined by deciding on a pvalue (e.g., 0.05, 0.01, 0.001), , that results in the tail area beneath the chisquared distribution equal to . is the degrees of freedom in the observation.
Mine distribution
The first aspect to verify was that mines were being uniformly placed on the board. For a standard board with mines, the expectation is that each grid location should be assigned times for trials. for this experiment.
In the above experiment, and . Since , this affirms and that the implemented distribution of mines is indeed uniform with a statistical significance of .
Distribution of successful clicks
The second aspect to verify is that the number of random clicks before exposing a mine follows a hypergeometric distribution. The hypergeometric distribution is appropriate since we are sampling (exposing) without replacement (the grid location remains exposed after clicking). This hypothesis relies on a nonfloodfill exposure.
The distribution has four parameters. The first is the number of samples drawn (number of exposures), the second the number of successes in the sample (number of empty exposures), the third the number of successes in the population (empty grid locations) and the last the size of the population (grid locations): .
The expected frequencies for the hypergeometric distribution is given by for trials. in this case.
In the above experiment and . Since , this affirms and that the number of locations exposed prior to exposing a mine follows a hypergeometric distribution with a statistical significance of .
Also included in the plot is the observed distribution for a flood based exposure. As one might expect, the observed frequency of more exposures decreases more rapidly than that of the nonflood based exposure.
Agents
Methodology
Much like how a human player would learn to play the game, I decided that each model would have knowledge of game’s mechanics and no prior experience with the game. An alternative class of agents would have prior experience with the game as the case would be in a human player who had studied other player’s strategies.
To evaluate the effectiveness of the models, each played against a series of randomly generated grids and their respective success rates were captured. Each game was played on a standard beginner’s grid containing between mines.
For those models that refer to a probability measure, , it is assumed that the measure is determined empirically and treated as an estimate of the probability of an event and not as an a priori measure.
Marginal Model
DevelopmentThe first model to consider is the Marginal Model. It is designed to simulate the behavior of a naive player who believes that if he observes a mine at a grid location that the location should be avoid in future trials. The model treats the visible board, , as a matrix of discrete random variables where each grid location is interpreted as either or (a) . This model picks the grid location with the greatest empirical probability of being empty:

Test Results
Since the mine distribution is uniform, the model should be equivalent to selecting locations at random. The expected result is that avoiding previously occupied grid locations is an ineffective strategy as the number of mines increases. This does however, provide an indication of what the success rate should look like for chance alone.
Conditional Model
DevelopmentOne improvement over the Marginal Model is to take into account the visual clues made visible when an empty grid location is exposed. Since an empty grid location represents the number of neighboring mines, the Conditional Model can look at these clues to determine whether or not an unexposed grid location contains a mine. This boils down to determining the probability of . A simplification in calculating the probability is to assume that each piece of evidence is independent. Under this assumption the result is a NaĂŻve Bayes Classifier: As in the case of the Marginal Model, the Conditional Model returns the grid location that it has determined has the greatest probability of being empty given its neighbors:

Test Results
The NaĂŻve Bayes Classifier is regarded as being an effective approach to classifying situations for a number of different tasks. In this case, it doesn’t look like it is effective at classifying mines from nonmines. The results are only slightly better than the Marginal Model.
Graphical Model
DevelopmentOne shortfall of the Conditional Model is that it takes a greedy approach in determining which action to take. A more sophisticated approach is to not just consider the next action, but the possible sequence of actions that will minimize the possibility of exposing a mine. Each of the possible observable grids, , can be thought of as a set of vertices in a graph whose corresponding set of edges represent the transition between a state, , to the next observable state, . Each transition was achieved by performing an action, , on the state. The specific action, , is chosen from a subset of permitted actions given the state. Each transition has a probability, , of taking place. It is possible to pick a path, , through this graph that minimizes the risk by assigning a reward, , to each state and attempting to identify an optimal path, , from the present state that yields the greatest aggregate reward, Solving for is equivalent to solving the Longest Path Problem and can be computed efficiently using a dynamic programming solution.

From the optimal walk, a sequence of optimal actions is determined by mapping over the path. Taking the first action gives the optimal grid location to expose given the current visible state of the board.
This description constitutes a Markov Decision Process. As is the case for most stochastic processes, it is assumed that the process holds the Markov Property; that future states only depend upon the current states and none of the prior states. In addition to being a Markov Decision Process, this is also an example of Reinforcement Learning.
First thing to observe is that the game state space is astronomical. For a standard beginner’s grid there is at most a sesvigintillion possible grids that a player can encounter. Which as an aside, is on the order of the number of atoms in the observable universe! The set of actions at each state is slightly more manageable with at most eightyone actions.
To simplify the state space, I chose to only consider boards and when evaluating a full grid, consider the possible subgrids and evaluate the optimal sequence of actions for each subgrid and pick the maximum reward associated for each subgrid that was evaluated as the action to take on the full grid.
Test Results
The Graphical Model produces results that are only a margin better than those of the Conditional Model.
Semideterministic Model
DevelopmentThe last model I’m going to talk about is a semideterministic model. It works by using the visible grid to infer the topology of the hidden grid and from the hidden grid, the topology that the visible grid can become. The grid can be viewed as a graph. Each grid location is a vertex and an edge is an unexposed grid locationâ€™s influence on another grid locationâ€™s neighbor mine count. For each of the exposed grid locations on the board, , it’s neighbors, , are all mines when the number of inbound edges , matches the visible mine count . The model produces its inferred version, , of the influence graph by using the determined mine locations . For each of the grid locations that are exposed and the inferred influence matches the visible count, then each of the neighbors about that location can be exposed provided they are not already exposed and not an inferred mine. From this set of possibilities, a mine location is chosen. When no mine locations can be determined, then an alternative model can be used. 
Test Results
Since the model is a more direct attempt at solving the board, its results are superior to the previously presented models. As the number of mines increases, it is more likely that it has to rely on a more probabilistic approach.
Summary
Each of the models evaluated offered incremental improvements over their predecessors. Randomly selecting locations to expose is on par with choosing a location based on previously observed mine locations. The Conditional Model and Graphical Model yield similar results since they both make decisions based on conditioned probabilities. The Semideterministic Model stands alone as the only one model that produced reliable results.
The success rate point improvement between the Condition and Marginal models is most notable for boards consisting of three mines and the improvement between Graphical and Semideterministic models for seven mines. Improvements between Random and Marginal models is negligible and between Conditional and Graphical is minor for all mine counts fewer than seven.
Given the mathematical complexity and nondeterministic nature of the machine learning approaches, (in addition the the complexity and time involved in implementing those approaches) they don’t seem justified when more deterministic and simpler approaches exist. In particular, it seems like most people have implemented their agents using heuristics and algorithms designed to solve constraint satisfaction problems. Nonetheless, this was a good refresher to some of the elementary aspects of probability, statistics and machine learning.
References
“Classification – NaĂŻve Bayes.” Data Mining Algorithms in R. Wikibooks. 3 Nov. 2010. Web. 30 Oct. 2011.
“Windows Minesweeper.” MinesweeperWiki. 8 Sept. 2011. Web. 30 Oct. 2011.
Kaye, Richard. “Minesweeper Is NPcomplete.” [pdf] Mathematical Intelligencer 22.2 (2000): 915. Web. 30 Oct. 2011.
Nakov, Preslav, and Zile Wei. “MINESWEEPER, #MINESWEEPER.” 14 May 2003. Web. 14 Apr. 2012.
Richard, Sutton, and Andrew G. Barto. “3.6 Markov Decision Processes.” Reinforcement Learning: An Introduction. Cambridge, Massachusetts: Bradford Book, 1998. 4 Jan. 2005. Web. 30 Oct. 2011.
Rish, Irene “An Empirical Study of the Naive Bayes Classifer.” [pdf] IJCAI01 Workshop on Empirical Methods in AI (2001). Web. 30 Oct. 2011.
Russell, Stuart J., and Peter Norvig. Artificial Intelligence: A Modern Approach. Upper Saddle River, NJ: Prentice Hall/PearsonEducation., 2003. Print.
Sun, Yijun, and Jian Li. “Adaptive Learning Approach to Landmine Detection.” [pdf] IEEE Transactions of Aerospace and Electronic Systems 41.3 (2005): 19. 10 Jan. 2006. Web. 30 Oct. 2011.
Taylor, John R. An introduction to error analysis: the study of uncertainties in physical measurements. Sausalito, CA: University Science Books, 1997. Print.