Rio Yokota has implemented exaFMM, a Fast Multipole Method library to speed applications that must quickly calculate the effects of long-range forces such as gravity or magnetism on discrete particles in a simulation. Based on work he performed as a post-doc with Lorena Barba, the open-source FMM library runs on GPUs, multicore, and Intel Xeon Phi plus most of the leadership-class supercomputers around the world. The P2P kernel of the exaFMM library can exceed a TF/s of performance on a single Intel Xeon Phi Knights Landing coprocessor or NVIDIA Kepler GPU. To my knowledge, this is second user-written (e.g. non-MKL library) code released to users that achieves this level of performance on Intel Xeon Phi. (The first is the Farber machine learning code described here).

N-body simulation is one of the “seven dwarfs of HPC”, or common patterns that have been identified by the HPC community for the most important algorithm classes in numerical simulation. The dwarfs identify particular patterns of computation and communication that are common to generally-utilized HPC scientific codes.

- Dense linear algebra
- Sparse linear algebra
- Spectral methods
- N-body methods
- Structured grids
- Unstructured grids
- Monte Carlo methods

Rio quantifies the exaFMM performance in terms of the roofline model, a performance metric that encapsulates all aspects of the machine performance in a single value. As you can see in the plot in the slide below, exaFMM achieves very high performance on both Intel Xeon Phi and NVIDIA Kepler accelerators. (A pdf of the slides can be found at this link.)

The Roofline Model and Seven Dwarfs of HPC

You can see a large particle number N-body system in operation in this video clip from the NVIDIA GTC conference http://youtu.be/aByz-mxOXJM?t=2m40s **Understanding computational intensity** Most computationally oriented scientists and programmers are familiar with BLAS (the Basic Linear Algebra Subprograms) library. BLAS is the *de facto* programming interface for basic linear algebra. BLAS is structured according to three different levels with increasing data and runtime requirements.

- Level-1: Vector-vector operations that require O(N) data and O(N) work. Examples include taking the inner product of two vectors, or scaling a vector by a constant multiplier.
- Level-2: Matrix-vector operations that require O(N
^{2}) data and O(N^{2}) work. Examples include matrix-vector multiplication or a single right-hand-side triangular solve. - Level-3 Matrix-vector operations that require O(N
^{2}) data and O(N^{3}) work. Examples include dense matrix-matrix multiplication.

The table below illustrates the amount of work that is performed by each BLAS level assuming that **N** floating-point values are transferred from the host to the GPU. It does not take into account the time required to transfer the data back to the GPU.

BLAS level | Data | Work | Work per Datum |

1 | O(N) | O(N) | O(1) |

2 | O(N^{2}) |
O(N^{2}) |
O(1) |

3 | O(N^{2}) |
O(N^{3}) |
O(N) |

Table: Work per datum by BLAS level

The “work per Datum” column tells us that Level-3 BLAS operations should run efficiently on GPUs and coprocessors because they perform O(N) work for every floating-point value transferred to the device. Thus the device spends its time calculating and not waiting for data. This same work-per-datum analysis applies to non-BLAS related computational problems such as N-body simulations. For example, the following slide shows that exaFMM spends most of the time computing (the red bars) on the Intel Xeon Phi processors in the TACC Stampede supercomputer. In other words, the time spent computing dominates the runtime even when using many Intel Xeon Phi devices (up to 256 in this slide). Weak-scaling describes how the solution time varies given the problem size (e.g. workload) assigned to each processing element remains constant while additional elements are used to solve a larger overall problem. Example two loop n-body code with O(N^2) scaling **Algorithmic speedups** The code for the O(N^{2}) runtime N-body simulation is quite simple as can be seen below. In contrast, the FMM method runs in O(N) – or a N-times faster – which can be quite significant in million or billion particle systems. The exaFMM library handles the complexity involved in correctly and efficiently implementing the Fast Multipole Method which is definitely a non-trivial algorithm. Even the most ardent “I’ll write my own code” advocate will be happy to use exaFMM after delving into the myriad of details required to implement a full and computationally efficient FMM method. More precisely, the Lorena Barba at Boston University notes:

*Multipole expansions are used to represent the effect of a cluster of particles, and are built for parent cells by an easy transformation (M2M). Local expansions are used to evaluate the effect of multipole expansions on many target points locally. The multipole-to-local (M2L) transformation is used only on the FMM, while treecodes compute the effect of multipole expansions directly on target points, with O(N log N) scaling for N particles. The FMM has the ideal O(N) scaling.*

While collaborating with Rio during my recent visit to KAUS, we found that using aligned vectors was key to getting much better performance than is reported below. The change was simple and required switching from **malloc()** to the _mm_malloc() vector aligned memory allocation call. Prior to my visit, performance of the P2P (Particle to Particle) kernel was around 800 GF/s. Switching to **_mm_malloc()** bumped performance by about 200 GF/s. Rio now reports that additional work has resulted in Intel Xeon Phi performance exceeding 1 TF/s per device! Following are some of the 2013 overall exaFMM performance numbers. The exaFMM library is available on a variety of leadership-class supercomputers around the world. However, if you need to calculate long-range forces between large numbers of particles on a small cluster or even a single device, then download exaFMM and give it a try.

## Leave a Reply