Posts Tagged ‘Cyclic Reduction’
Using GPUs to solve Spatial FitzHughNagumo Equation Numerically
Introduction
Modeling Action Potentials
In neurophysiology, we wish to model how electrical impulses travel along the complex structures of neurons. In particular, along the axon since it is the principle outbound channel of the neuron. Along the axon are voltage gated ion channels in which concentrations of potassium and sodium are exchanged. During depolarization, a fast influx of sodium causes the voltage to gradually increase. During repolarization, a slow outflux of potassium causes the voltage to decrease gradually. Because the potassium gates are slow to close, there is a negative voltage dip and recovery phase afterwards called hyperpolarization.
Attempts to provide a continuous time model of these phases began with work in the 1950s by Hodgkin and Huxley [HH52]. The duo formulated a fourdimensional nonlinear system of ordinary differential equations based on an electrical circuit model of giant squid axons. This landmark contribution was later simplified in the 1960s by FitzHugh [Fit61]. Here FitzHugh casted the problem in terms of Van der Pol oscillators. This reformulation allowed him to explain the qualitative behavior of the system in terms of a twodimensional excitation/relaxation statespace model. The following year Nagumo [NAY62] extended these equations spatially to model the desired action potential propagation along an axon, and demonstrated its behavior on analog computers. This spatial model will be the focus of this work, and a more comprehensive account of these developments can be found in [Kee02].
FitzHughNagumo Equation
The Spatial FitzHughNagumo equation is a twodimensional nonlinear reactiondiffusion system:

(1) 
Here represents the action potential (voltage) along the axon, and the recovery of the system. Constant is absent from the literature, and is inversely proportional to recovery time.
Mathematical Analysis
The system does not admit a general analytic solution. Mathematical analysis of the system is concerned with the existence and stability of traveling waves with [AK15] providing a thorough account of these aspects. Other analyses are concerned with the statespace model and the relationship of parameters with observed physiological behavior. In particular: selfexcitatory, impulse trains, single traveling wavefronts and impulses, doubly traveling divergent wave impulses, and nonexcitatory behavior that returns the system to a resting state. Since the FitzHughNagumo equations are well understood, more recent literature is focused on higher dimensional and stochastic variants of the system which [Tuc13] discusses in detail.
Numerical Analysis
A survey of the literature revealed several numerical approaches to solving the FitzHughNagumo equations consisting of a Finite Element method with Backward Differentiation Formulae [Ott10], and the Method of Lines [Kee02] approach. In this work, three approaches based on the Finite Difference method are investigated: an explicit scheme, an adaptive explicit scheme, and an implicit scheme; along with their associated sequential CPU bound and parallel GPU bound algorithms.
Explicit Finite Difference Scheme
To study the basic properties of the FitzHughNagumo equations an explicit scheme was devised using forward and central differences for the temporal and spatial derivatives respectively.

(2) 
Truncation errors are linear in time, , and quadratic, , in space. For stability, , implying the use of inconveniently small time steps, however in practice it seems rare to find any mention of in the literature. Each time step can be computed sequentially or in parallel in linear time, . However, there is a significant constant factor improvement delivered by parallel GPU implementations since modern GPUs allow millions of nodes to be updated in parallel giving near constant runtime performance. Further parallelization is possible by trivially distributing contiguous node sets to a combination of multiple machines and multiple GPUs.
Experiments
The explicit scheme was used to investigate the traveling wave and divergent wave behaviors of the FitzHughNagumo equations. Fig. (1) demonstrate the application of a constant impulse, , at the end of an unexcited axon, , giving rise to oscillatory behavior. Fig. (2) shows a Gaussian impulse applied initially to the center of the axon, . As the system evolves, the impulse collapses giving rise to two separate impulses that travel in opposite directions before dissipating at the boundaries returning the axon to a completely unexcited state. Both test cases are qualitatively consistent with the literature and share the following parameters:
(3) 
Error Analysis
Figure 3: Varying values of w.r.t at . fixed to 0.0098. 
Figure 4: Varying values of w.r.t at different values of . fixed to 0.5. 
Two experiments were ran to verify the suggested truncation errors. The numerical solution given by a sufficiently small step size serves as an analytic solution to , which is then used to evaluate how larger steps sizes deviate as measured by the root mean squared error. Fig. (3) looks at varying and shows that as the step size as halved, the resulting RMSE is quartered consistent with the expect quadratic truncation term of the scheme. Similarly, Fig. (4) looks at varying and shows that independent of , halving the step size results in an equally reduced RMSE consistent with the expected linear truncation term of the scheme.
Runtime Performance
Figure 5: Wall time for memory allocation, memory transfers, and core loop for both CPU and GPU. 
Figure 6: Wall time for just core loop for both CPU and GPU. 
Comparison of sequential CPU and parallel GPU bound algorithms is based on the wall time taken to perform a single iteration to calculate and . Reported figures are a function of spatial node quantity and the mean run time of 30 iterations. The main bottleneck to performance is transferring buffers between the host and device. Fig. (5) illustrates this effect. For the largest test case considered, , “GPU w/o tx” delivers 19x faster runtime over “GPU w/ tx” by not transferring intermediate results back to the host. Similarly, it delivers significantly faster runtime over the CPU by a factor of 73x. To better understand this performance boost, Fig. (6) looks at just the core loop execution time. For , the core loops account for only 3.7% and 14.9% of the execution time on the CPU and GPU respectively with the GPU delivering 18x better performance than the CPU. These percentages increase monotonically, and suggest that memory transfers will eventually be dominated by sufficiently large inputs on GPUs with an abundance of memory.
Adaptive Explicit Finite Difference Scheme
While investigating the traveling wavefront solution of the system, numerical oscillations were observed as the simulation zeroed in on the steady state solution. To address this issue, an adaptive explicit scheme was devised. A survey of the literature suggested a family of moving grid methods to solve the FitzHughNagumo system based on a Lagrangian formulation [VBSS90] and a method of lines formulation [Zwa11]. Here a heuristic is used to concentrate grid points where the most change takes place.
The formulation of the scheme is identical to the former section with second order spatial derivatives being approximated by Lagrange interpolating polynomials since the technique supports nonuniform grids. A three point, first order truncation error scheme is used. A five point, third order truncation error scheme was considered, but abandoned in favor of the empirically adequate three point scheme.
(4) 
The first and second spatial derivatives given by the Lagrange interpolating polynomials are used to decide how much detail is needed in the domain in addition to the first temporal derivative given by finite differences. First, a coarse grid is laid down across the entire domain to minimize the expected distance between nodes since the second order derivative has first order truncation error. Next, the magnitude of the first order spatial derivative (second order truncation error) is used to lay down a finer grid when the derivative is greater than a specified threshold. This corresponds to where the waves of the solution are.
Next, the temporal derivative is used to lay down an even finer grid in those areas having an absolute change above a specified threshold. The change in time corresponds to the dynamics of the system, by adding detail in these areas, we can preserve the behavior of the system. Finally, the zeros of the second spatial derivative serve as indicators of inflection points in the solution. These correspond most closely to the location of the traveling wavefronts of the equation. Here, the most detail is provided around a fixed radius of the inflection points since the width of the wavefronts does not depend on parameterization.
Each iteration, the explicit scheme is evaluated on the grid from the previous iteration and those results are then used to perform the grid building scheme. To map the available solution values to the new grid, the points are linearly interpolated if a new grid point falls between two previous points, or mapped directly if there is a stationary grid point between the two iterations. The latter will be the more common case since all grid points are chosen from an underlying uniform grid specified by the user. Linear interpolation will only take place when extra grid points are included in an area.
Experiments
Figure 7: Top: Numerical oscillation of a centered Gaussian impulse with and all other parameters the same as those given in previous section for . Bottom: Eliminated numerical oscillation based on the adaptive explicit scheme.
Fig. (7) is the motivating example for this scheme and demonstrates how numerical oscillations can be avoided by avoiding calculations in regions with stationary solutions. To demonstrate that the scheme works well for other test cases, the more interesting and dynamic divergent impulse test case is shown in Fig. (8). Here we can see that as time progresses, points are allocated to regions of the grid that are most influenced by the system’s dynamics without sacrificing quality. For the time steps shown, the adaptive scheme used 23x fewer nodes than the explicit scheme from the previous section.
Figure 8: Example of adaptive grid on the divergent impulse example for { 15.06, 30.13, 45.05, 60.12, 75.05, 90.11 }. Red lines given by the explicit scheme from the previous section, green dots given by adaptive explicit scheme.
Implicit Finite Difference Scheme
An implicit CrankNicolson scheme is used in this section to solve the FitzHughNagumo equations. To simplify the computations, is solved explicitly using a second order central difference scheme before solving leading to the following formulation:

(5) 
The truncation error for this scheme is with an improved, albeit still restrictive, requirement over the explicit scheme that . Based on this formulation, the righthand side is completely known when is calculated. This gives a simpler expression to consider:
(6) 
Newton’s method is used to solve the nonlinear function with an initial guess equal to the previous time step’s values, . To refine the estimate, the following system is solved iteratively until the magnitude of the refinement, , is less than a specified tolerance or machine epsilon.
(7) 
This formulation gives rise to a tridiagonal Jacobian matrix, , where the first and last row are specified by noflux boundary conditions, and constant offdiagonal entries and variable diagonal entries given by the partial derivatives of each nodal equation.

This tridiagonal system can be solved sequentially in time using the Thomas algorithm or in time using the Cyclic Reduction algorithm of [Hoc65]. Cyclic Reduction begins with a forward phase in which all odd indexed unknowns are eliminated recursively until a or system remains that can be solved directly. Cyclic Reduction ends with a backward phase that walks up the recursion stack to solve for the previously eliminated odd indexed unknowns. Since the algorithm decouples unknowns at each recursion level of the forward and backwards phases, these reductions can be done in parallel in time on the GPU assuming fold parallelism.
Further parallelism can be achieved by solving points explicitly along the domain, then using those results to create implicit subdomains that can be solved using either Thomas or Cyclic Reduction algorithms on a combination of multiple machines and multiple GPUs at the expense of additional communication and coordination.
Error Analysis
Figure 9: Varying values of w.r.t at . fixed to 0.000071. 
Figure 10: Varying values of w.r.t at different values of . fixed to 0.52. 
Evaluation of the spatial error revealed an unexpected linear behavior as shown in Fig. (9). As the spatial step is halved, the resulting error was expected to become quartered, instead it is halved. No clear explanation was discovered to account for this discrepancy. With respect to time, Fig. (10) shows that both the Thomas and Cyclic Reduction algorithms were quadratic as multiple points in time were evaluated. The Thomas algorithm produced aberrations as the step size increased eventually producing numerical instability, while the Cyclic Reduction algorithm was immune to this issue.
Figure 11: Stability of implicit solvers. 
Figure 12: Convergence of implicit solvers. 
In terms of stability, the implicit scheme is stable up to depending on which tridiagonal solver is used, and for comparison, the explicit scheme is stable up to precisely as shown in Fig. (11). The Thomas algorithm demonstrates a slightly weaker stability guarantee than the Cyclic Reduction algorithm becoming unstable around and respectively. In terms of convergence, the Thomas algorithm typically takes more iterations than Cyclic Reduction to obtain the same degree of accuracy as shown in Fig. (12). Taking the sequence of adjustments up to machine epsilon , Thomas algorithm gives suggesting qlinear convergence. Likewise, Cyclic Reduction gives suggesting qquadratic convergence.
Runtime Performance
Figure 13: Performance comparison of CPU and GPU bound Thomas and Cyclic Reduction algorithms. 
Figure 14: Performance comparison of Jacobian solvers. 
Sequential Thomas and Cyclic Reduction routines perform equally well on CPU as shown in Fig. (13). The parallel Cyclic Reduction method did not demonstrate significant performance gains on the GPU. However, looking at just the time taken to solve the Jacobian each iteration (not including initialization or memory transfers), parallel Cyclic Reduction on the GPU was 45x faster than both sequential CPU solvers as shown in Fig. (14).
To explain the poor performance of Cyclic Reduction on the GPU, there are a number of different factors at play. The algorithm is susceptible to warp divergence due to the large number of conditionals that take place. Reliance on global memory access with varying strides contributes to slow performance since shared memory can’t be effectively utilized, and each iteration the adjustments are transferred from device to host to decide if Newton’s method should terminate. These different factors suggest alternative GPU implementations need to be investigated to address these different issues.
Discussion
Work Environment
All results in this paper were based on code written in C#, and compiled using Microsoft Visual Studio Express with Release settings to run on a commodity class Intel Core i72360QM quad core processor. Open source CUDAfy.NET is used to run C# to CUDA translated code on a commodity class NVIDIA GeForce GT 525m having two streaming multiprocessors providing 96 CUDA cores.
Future Work
Numerically, additional work could be done on the adaptive explicit scheme. In preparing the scheme, a cubic splinebased approach was abandoned in favor of simpler approaches due to time pressures. It would be worthwhile to explore how to solve the system on a splinebased “grid”.
In addition, further work could be done to optimize the implementation of the parallel Cyclic Reduction algorithm on the GPU since it delivered disappointing runtime behavior compared to the sequential algorithm on the CPU. [CmWH14] mention several different optimizations to try, and I believe better global memory access will improve runtime at the expense of more complicated addressing. As an alternative to Cyclic Reduction, both [CmWH14] and [ZCO10] detail several different parallel tridiagonal solvers that could be deployed.
In terms of models considered, there are a number of different directions that could be pursued including higher dimensional [MC04], coupled [Cat14]], [Ril06], and stochastic variants [Tuc13] of the spatial FitzHughNagumo equation. Coming from a probabilistic background, I would be interested in investing time in learning how to solve stochastic ordinary and partial differential equations.
Conclusion
Three finite difference schemes were evaluated. An explicit scheme shows great performance on both the CPU and GPU, but it is susceptible to numerical oscillations. To address this issue, an adaptive explicit scheme based on heuristics was devised and is able to eliminate these issues while requiring fewer nodes to produce results onpar with the explicit scheme. An implicit scheme was evaluated which demonstrated a principled, and robust solution for a variety of test cases and is the favored approach of the three evaluated.
References
[AK15] Gianni Arioli and Hans Koch. Existence and stability of traveling pulse solutions of the fitzhughnagumo equation. Nonlinear Analysis: Theory, Methods and Applications, 113:5170, 2015
[Cat14] Anna Cattani. Fitzhughnagumo equations with generalized diffusive coupling. Mathematical Biosciences and Engineering, 11(2):203215, April 2014
[CmWH14] LiWen Chang and Wen mei W. Hwu. A guide for implementing tridiagonal solvers on gpus. In Numerical Computations with GPUs, pages 2944. Springer International Publishing, 2014.
[Fit61] Richard FitzHugh. Impulses and physiological states in theoretical models of nerve membrane. Biophysical journal, 1(6):445, 1961.
[HH52] Alan L. Hodgkin and Andrew F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of physiology, 117(4):500544, 1952.
[Hoc65] R. W. Hockney. A fast direct solution of poisson’s equation using fourier analysis. J. ACM, 12(1):95113, Jan 1965.
[Kee02] James P. Keener. Spatial modeling. In Computational Cell Biology,volume 20 of Interdisciplinary Applied Mathematics, pages 171197. Springer New York, 2002.
[MC04] Maria Murillo and XiaoChuan Cai. A fully implicit parallel algorithm for simulating the nonlinear electrical activity of the heart. Numerical linear algebra with applications, 11(23):261277, 2004.
[NAY62] J. Nagumo, S. Arimoto, and S. Yoshizawa. An active pulse transmission line simulating nerve axon. Proceedings of the IRE, 50(10):20612070, Oct 1962.
[Ott10] Denny Otten. Mathematical models of reaction diffusion systems, their numerical solutions and the freezing method with comsol multiphysics. 2010.
[Ril06] Caroline Jane Riley. Reaction and diffusion on the sierpinkski gasket. PhD thesis, University of Manchester, 2006.
[Tuc13] Henry C. Tuckwell. Stochastic partial differential equations in neurobiology. Linear and nonlinear models for spiking neurons. In Stochastic Biomathematical Models, volume 2058 of Lecture Notes in Mathematics. Springer Berlin Heidelberg, 2013.
[VBSS89] J. G. Verwer, J. G. Blom, and J. M. SanzSerna. An adaptive moving grid method for one dimensional systems of partial differential equations. Journal of Computational Physics, 82(2):454486, 1989.
[ZCO10] Yao Zhang, Jonathan Cohen, and John D. Owens. Fast tridiagonal solvers on the gpu. SIGPLAN Not., 45(5):127136, Jan 2010.
[Zwa11] M. N. Zwarts. A test set for an adaptive moving grid pde solver with timedependent adaptivity. Master’s thesis. Utrecht University, Utrecht, Netherlands, 2011.