The chapter authors discuss the hardware heterogeneity found in modern clusters and then analyze a typical Intel Xeon Phi coprocessor accelerated node on the Stampede cluster at TACC, with an eye towards how MPI is used in similar clusters, and the positioning an MPI task within the node. The performance through different communication pathways is highlighted using micro benchmarks. Finally, a hybrid Lattice Boltzmann Method application is configured with optimal MPI options, and scaling performance is evaluated both with and without proxy-based communications.
In the early days of MPI, the compute nodes were simple and there was no concern about positioning MPI tasks. The nodes in modern HPC systems have multiple processors, coprocessors and graphics accelerators. Each coprocessor and graphics device has its own memory, while the processors share memory. On multi-‐core systems, launching an MPI task on every core is a common practice,In systems with a coprocessor, it is more likely that MPI hybrid programs will drive several MPI tasks on the host and a few more or less on the coprocessor, the OpenMP parallel regions will employ a different number of threads for host and device parallel regions. So, with hybrid programs running across heterogeneous architectures, asymmetric interfaces and NUMA architectures, it is understandable that positioning tasks and threads will be more important on the modern HPC systems.
Jerome is a Research Associate of the Texas Advanced Computing Center (TACC) at the University of Texas at Austin. His research interests include Performance Analysis and Modeling, High Performance Computing, High Performance Networking, Benchmarking and Exascale Computing.
Carlos is co-‐director of the Advanced Computing Evaluation Laboratory at the Texas Advanced Computing Center, where his main responsibility is the evaluation of new computer architectures relevant to High Performance Computing. He is the author of the open
source mplabs code.
Kent has been an instructor, scientist and HPC programmer at the Center for High Performance Computing at UT since its earliest days. His expert training, for the TACC user community, exposes methods of mapping programming paradigms to hardware that are efficient and seek to obtain the highest possible performance.
Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.