• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / Featured article / Efficient Nested Parallelism On Large Scale Systems

Efficient Nested Parallelism On Large Scale Systems

October 28, 2014 by Rob Farber Leave a Comment

Choosing right threading library is critical for application performance, as different threading libraries provide significantly different performance behavior, especially when dealing with the complex computer systems as Intel Xeon Phi coprocessor and NUMA Intel Xeon processor machines. Unfortunately, choosing the right threading library is not enough, addition application requirements may be not fulfilled by default implementation. In our work we presented few approaches how to resolve issue of “memory hungry” application on Intel Xeon Phi coprocessor. In addition, the hierarchical arena approach can be used to minimize NUMA related overheads.

Cover3D-fs8

Recent increase in number of cores in Intel® Xeon® processor and Intel Xeon Phi coprocessor lead us to review the approaches taken in parallel libraries. In general, we have observed that scalability tends to be an issue because the overhead spent by a threading library tends to increase if there are more threads to synchronize when the total amount of data to be processed remains the same. As a result, scalability across processing cores suffers. Some applications are driven by task parallelism, or the ability to run multiple tasks in parallel, as well as data parallelism (the ability to run a single task in parallel across a data set). The combination of task-­ and data-­parallelism provides an opportunity to minimize threading library overheads.

Task arenas within Application(courtesy Morgan Kaufmann)

Task arenas within Application (courtesy Morgan Kaufmann)

Chapter Authors

Evgeny Fiksman

Evgeny Fiksman

Evgeny joined Intel in 2006 and worked on optimization of video enhancement algorithms for x86 platforms. During this period Evgeny acquired expertise in multi-­‐threading and low level programing. For the next 5 years Evgeny was leading engineer and architected for the implementation of OpenCL runtime for Intel CPUs (Core, Xeon & Atom) and Xeon Phi coprocessors. Recently, Evgeny’ve joined a software enabling team, which is focused on financial applications. Prior joining Intel Evgeny lead development of a naval team training simulator. Evgeny’s holding B.Sc and M.Sc in Electrical Engineering from the Technion – Israel institute of Technology, Haifa, Israel.  

Anton Malakhov

Anton Malakhov

Anton is software development engineer at Intel SSG, working on Intel® Threading Building Blocks (Intel® TBB) project since 2006. He optimized Intel TBB task scheduler for Intel Xeon Phi coprocessor and invented few scheduling algorithms which improved performance of Intel OpenCL runtime on MIC architecture. As a senior development engineer, he is currently responsible for productization of TBB components like task_arena and affinity_partitioner. Anton has an equivalent of MS degree in computer engineering from the Omsk State Technical University, Russia (2002).

Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.

q

Share this:

  • Twitter

Filed Under: Featured article, Featured news, News, News, Xeon Phi Tagged With: HPC, Intel, Intel Xeon Phi, x86

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Rob Farber
  • NVIDIA GTC'17 announcements make them a complete 'soup to nuts' solution for specialized deep-learning applications
  • Altera OpenCL Programmable FPGA Talks QPI, HMC, and 100G Optical Interconnect
  • Facebook Open Source GPU FFT 1.5x Faster Than NVIDIA CUFFT
  • Intel Xeon Phi Study Guide

Archives

© 2026 · techenablement.com