Efficient Nested Parallelism On Large Scale Systems

Choosing right threading library is critical for application performance, as different threading libraries provide significantly different performance behavior, especially when dealing with the complex computer systems as Intel Xeon Phi coprocessor and NUMA Intel Xeon processor machines. Unfortunately, choosing the right threading library is not enough, addition application requirements may be not fulfilled by default implementation. In our work we presented few approaches how to resolve issue of “memory hungry” application on Intel Xeon Phi coprocessor. In addition, the hierarchical arena approach can be used to minimize NUMA related overheads.

Recent increase in number of cores in Intel® Xeon® processor and Intel Xeon Phi coprocessor lead us to review the approaches taken in parallel libraries. In general, we have observed that scalability tends to be an issue because the overhead spent by a threading library tends to increase if there are more threads to synchronize when the total amount of data to be processed remains the same. As a result, scalability across processing cores suffers. Some applications are driven by task parallelism, or the ability to run multiple tasks in parallel, as well as data parallelism (the ability to run a single task in parallel across a data set). The combination of task- and data-parallelism provides an opportunity to minimize threading library overheads.

Task arenas within Application (courtesy Morgan Kaufmann)

Chapter Authors

Evgeny Fiksman

Evgeny joined Intel in 2006 and worked on optimization of video enhancement algorithms for x86 platforms. During this period Evgeny acquired expertise in multi-‐threading and low level programing. For the next 5 years Evgeny was leading engineer and architected for the implementation of OpenCL runtime for Intel CPUs (Core, Xeon & Atom) and Xeon Phi coprocessors. Recently, Evgeny’ve joined a software enabling team, which is focused on financial applications. Prior joining Intel Evgeny lead development of a naval team training simulator. Evgeny’s holding B.Sc and M.Sc in Electrical Engineering from the Technion – Israel institute of Technology, Haifa, Israel.

Anton Malakhov

Anton is software development engineer at Intel SSG, working on Intel® Threading Building Blocks (Intel® TBB) project since 2006. He optimized Intel TBB task scheduler for Intel Xeon Phi coprocessor and invented few scheduling algorithms which improved performance of Intel OpenCL runtime on MIC architecture. As a senior development engineer, he is currently responsible for productization of TBB components like task_arena and affinity_partitioner. Anton has an equivalent of MS degree in computer engineering from the Omsk State Technical University, Russia (2002).

Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.

Chapter Authors

Share this:

Leave a Reply Cancel reply