Chapter 12 of High Performance Parallelism Pearls discusses optimizing performance when offloading concurrent kernels (e.g. task-parallelism) to the Intel Xeon Phi coprocessor. The authors state, “Our ultimate optimization target in this chapter is to improve the computational throughput of multiple small-scale workloads on the Intel Xeon Phi coprocessor by concurrent kernel offloading.” Concurrent kernel offload targets application scenarios with many small-scale workloads that cannot individually exploit all the resources of the device. The chapter authors (Florian Wende, Michael Klemm, Thomas Steinke, and Alexander Reinefeld) discuss how the computational throughput for multiple small-scale workloads can be improved on the Intel Xeon Phi coprocessor through concurrent kernel execution using the offload programming model. Each of the optimization steps are elaborated and illustrated by working examples. Performance improvements are presented for two demonstration scenarios a particle force simulation and dgemm.

Performance of a PD simulation using Newton’s 3rd law for the force computation with and without concurrent offloading. The performance is additionally compared against a dual socket Intel Xeon processor (Sandy Bridge) execution using 32 threads. The dotted horizontal line is for the non-concurrent case. The solid line is for the execution using two thread groups of size 16, residing on separate CPU sockets. (courtesy Morgan Kaufmann)
Note that GCC is likely to start supporting Intel Xeon Phi offload semantics in 2015.
Chapter Authors
Florian is a part of the Scalable Algorithms workgroup (department Distributed Algorithms and Supercomputing) at Zuse Institute Berlin (ZIB). e is interested in accelerator and many-core computing with application in Computer Science and Computational Physics. His focus is on load balancing of irregular parallel computations and on close-to-hardware code optimization. He received a Diploma in Physics (Dipl.-Phys., equivalent to M.Sc.) in 2010 from the Humboldt-Universität zu Berlin, and a B.Sc. in Computer cience in 2013 from the Freie Universität Berlin.
Dr. Michael Klemm is part of Intel’s Software and Services Group, Developer Relations Division. His focus is on High Performance and Throughput Computing. He obtained an M.Sc. in Computer Science and a Doctor of Engineering degree (Dr.-Ing.) in Computer Science from the Friedrich-Alexander-Universität Erlangen-Nürnberg. Michael’s areas of interest includes compiler construction, design of programming languages, parallel programming, and performance analysis and tuning. Michael is Intel representative in the OpenMP Language Committee and leads the efforts to develop error handling features for OpenMP.
Thomas Steinke is head of the Supercomputer Algorithms and Consulting group at the Zuse Institute Berlin (ZIB). He received his Doctorate in Natural Sciences (Dr. rer. nat.) in 1990 from the Humboldt University of Berlin. His research interest is in high-performance computing, heterogeneous systems for scientific and data analytics applications, and parallel simulation methods. Thomas leads the IPCC at ZIB since 2013 and he was co-founder of the OpenFPGA initiative in 2004.
Alexander Reinefeld is the head of the Computer Science Department at Zuse Institute Berlin (ZIB) and a professor at the Humboldt University of Berlin. He received his PhD and MSc from the University of Hamburg in 1987 and 1982, respectively. He has been awarded a PhD scholarship by the German Academic Exchange Service and a Sir Izaak Walton Killam Post Doctoral Fellowship from the University of Alberta, Canada. Alexander has served as an assistant professor at University of Hamburg and as the managing director at the Paderborn Center for Parallel Computing. He co-founded the North German Supercomputing Alliance (HLRN), the European Grid Forum, the Global Grid Forum, and the German e-Science initiative D-Grid. His research interrest is in distributed computing, high-performance computer architecture, scalable and dependable computing, and peer-to-peer algorithms. He published numerous scientific papers and holds patents on logarithmic routing protocols.
Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.




Leave a Reply