The chapter authors (Rio Yokota and Mustafa Abdul Jabbar) achieve roughly 1.5 TF/s single-precision performance when running an optimized direct N-body kernel on an Intel Xeon Phi coprocessor. This level of performance was achieved through the use of OpenMP, SIMD directives, and _mm512 intrinsics. The authors note the strong scalability of the execution was close to ideal, while noting that the intra-core threading was less efficient.
N-body methods have an arithmetic intensity that is even higher than DGEMM, and can extract the full potential of architectures such as used on the Intel Xeon Phi coprocessor. The direct N-body kernel is the key component of the fast multipole method (FMM). FMM is recently attracting a lot of attention because it has a unique combination of optimal arithmetic complexity of O(N), while having high arithmetic intensity, and an optimal communication complexity of O(logP), while having high asynchronicity. FMM can be used instead of FFT or sparse matrix solvers in many scientific simulations, so there is a large amount of potential applications for it. Basically, it’s a very Exascalable algorithm that can be swapped with the major algorithms of today in many scientific codes.
See the TechEnablement article, “ExaFMM: An Exascale-capable, TF/s per GPU or Xeon Phi, Long-Range Force Library for Particle Simulations” for more information about the ExaScale capable ExaFMM library.

Single-precision GF/s of the direct N-body kernel on Intel Xeon Phi coprocessor without intrinsics on different number of cores. (courtesy Morgan Kaufmann)

Single precision gigaFlop/sec, on an Intel Xeon Phi coprocessor, for the direct N‐body kernel with intrinsics and without intrinsics with vectorization of the outer loop for i or inner loop for j for different problem sizes. (courtesy Morgan Kaufmann)
Chapter Authors
Rio Yokota is a Research Scientist in the Extreme Computing Research Center at KAUST. He is the main developer of the FMM library ExaFMM. He was part of the team that won the Gordon Bell prize for price/performance in 2009 using his FMM code on 760 GPUs. He is now optimizing his ExaFMM code on architectures such as Titan, Mira, Stampede, K computer, and TSUBAME 2.5.
Mustafa Abdul Jabbar is a PhD student in the Extreme Computing Research Center at KAUST. He works on optimization of the FMM algorithm using low-‐level programming and data-‐driven runtime systems.
Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.



Leave a Reply