• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / Featured article / N-body Methods on Intel Xeon Phi Coprocessors

N-body Methods on Intel Xeon Phi Coprocessors

October 16, 2014 by Rob Farber Leave a Comment

The chapter authors (Rio Yokota and Mustafa Abdul Jabbar) achieve roughly 1.5 TF/s single-precision performance when running an optimized direct N-body kernel on an Intel Xeon Phi coprocessor. This level of performance was achieved through the use of OpenMP, SIMD directives, and _mm512 intrinsics. The authors note the strong scalability of the execution was close to ideal, while noting that the intra-core threading was less efficient.

Cover3D-fs8

N-body methods have an arithmetic intensity that is even higher than DGEMM, and can extract the full potential of architectures such as used on the Intel Xeon Phi coprocessor. The direct N-body kernel is the key component of the fast multipole method (FMM). FMM is recently attracting a lot of attention because it has a unique combination of optimal arithmetic complexity of O(N), while having high arithmetic intensity, and an optimal communication complexity of O(logP), while having high asynchronicity. FMM can be used instead of FFT or sparse matrix solvers in many scientific simulations, so there is a large amount of potential applications for it. Basically, it’s a very Exascalable algorithm that can be swapped with the major algorithms of today in many scientific codes.

See the TechEnablement article, “ExaFMM: An Exascale-capable, TF/s per GPU or Xeon Phi, Long-Range Force Library for Particle Simulations” for more information about the ExaScale capable ExaFMM library.

Single-precision GF/s  of the direct N-body kernel on Intel Xeon Phi coprocessor without intrinsics on different number of cores

Single-precision GF/s of the direct N-body kernel on Intel Xeon Phi coprocessor without intrinsics on different number of cores. (courtesy Morgan Kaufmann)

Single precision gigaFlop/sec, on an Intel Xeon Phi coprocessor, for the direct N‐body kernel with intrinsics and without intrinsics with vectorization of the outer loop for i or inner loop for j for different problem sizes.

Single precision gigaFlop/sec, on an Intel Xeon Phi coprocessor, for the direct N‐body kernel with intrinsics and without intrinsics with vectorization of the outer loop for i or inner loop for j for different problem sizes. (courtesy Morgan Kaufmann)

 

Chapter Authors

Rio Yokota

Rio Yokota

Rio Yokota is a Research Scientist in the Extreme Computing Research Center at KAUST. He is the main developer of the FMM library ExaFMM. He was part of the team that won the Gordon Bell prize for price/performance in 2009 using his FMM code on 760 GPUs. He is now optimizing his ExaFMM code on architectures such as Titan, Mira, Stampede, K computer, and TSUBAME 2.5.

Mustafa AbdulJabbar

Mustafa AbdulJabbar

Mustafa Abdul Jabbar is a PhD student in the Extreme Computing Research Center at KAUST. He works on optimization of the FMM algorithm using low-­‐level programming and data-­‐driven runtime systems.

Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.

Share this:

  • Twitter

Filed Under: Featured article, Featured news, News, News, Xeon Phi Tagged With: HPC, Intel, Intel Xeon Phi, x86

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Mix OpenACC and CUDA (including Thrust)
  • Optimizing for Reacting Navier‐Stokes Equations
  • Rob Farber
  • From ‘Correct’ to ‘Correct & Efficient’: a Hydro2D case study with Godunov’s scheme
  • AMD and Pathscale Join OpenACC Standards Committee

Archives

© 2026 · techenablement.com