• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / Featured article / Heterogeneous Computing with MPI On Intel Xeon Phi

Heterogeneous Computing with MPI On Intel Xeon Phi

October 21, 2014 by Rob Farber Leave a Comment

The chapter authors discuss the hardware heterogeneity found in modern clusters and then analyze a  typical Intel Xeon Phi coprocessor accelerated node on the Stampede cluster at TACC, with an eye towards how MPI is used in similar clusters, and the positioning an MPI task within the node. The performance through different communication pathways is highlighted using micro benchmarks. Finally, a hybrid Lattice Boltzmann Method application is configured with optimal MPI options, and scaling performance is evaluated both with and without proxy-based communications.

Cover3D-fs8

In the early days of MPI, the compute nodes were simple and there was no concern about positioning MPI tasks. The nodes in modern HPC systems have multiple processors, coprocessors and graphics accelerators. Each coprocessor and graphics device has its own memory, while the processors share memory.  On multi-­‐core systems, launching an MPI task on every core is a common practice,In systems with a coprocessor, it is more likely that MPI hybrid programs will  drive several MPI tasks on the host and a few more or less on the coprocessor,  the OpenMP parallel regions will employ a different number of threads for host and device parallel regions. So, with hybrid programs running across heterogeneous architectures, asymmetric interfaces and NUMA architectures, it is understandable that positioning tasks and threads will be more important on the modern HPC systems.

Intel MPI with and  without InfiniBand proxy. (courtesy Morgan-Kaufmann)

Intel MPI with and without InfiniBand proxy. (courtesy Morgan Kaufmann)

Screenshot - 10202014 - 07:34:33 PM

Scalability of LBM code with and without proxy enabled for native CPU and native coprocessor execution. (courtesy Morgan Kaufmann)

Chapter Authors

Jerome Vienne

Jerome Vienne

Jerome is a Research Associate of the Texas Advanced Computing Center (TACC) at the University of Texas at Austin. His research interests include Performance Analysis and Modeling, High Performance Computing, High Performance Networking, Benchmarking and Exascale Computing.

Carlos Rosales‐Fernandez

Carlos Rosales‐Fernandez

Carlos is co-‐director of the Advanced Computing Evaluation Laboratory at the Texas Advanced Computing Center, where his main responsibility is the evaluation of new computer architectures  relevant to High Performance Computing. He is the author of the open
source mplabs code.

Kent Milfeld

Kent Milfeld

Kent has been an instructor, scientist and HPC programmer at the Center for High Performance Computing at UT since its earliest days. His expert training, for the TACC user community, exposes methods of mapping programming paradigms to hardware that are efficient and seek to obtain the highest possible performance.

Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.

Share this:

  • Twitter

Filed Under: Featured article, Featured news, News, News, Xeon Phi Tagged With: HPC, Intel, Intel Xeon Phi, x86

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Accelerating the Traveling Salesman Problem with GPUs and Intel Xeon Phi
  • CUDA 340.29 Driver Significantly Boosts GPU Performance (100s GF/s For Machine-Learning)
  • Remote Teaching Rooms Available At SC14
  • SenseHUD $99 Heads Up Display for Cars - Pre-Order Price
  • NVIDIA Moves Deeper into the Data Center with the P4 and P40 Inference GPUs

Archives

© 2026 · techenablement.com