• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / Featured article / Concurrent Kernel Offloading On Intel Xeon Phi

Concurrent Kernel Offloading On Intel Xeon Phi

October 20, 2014 by Rob Farber Leave a Comment

Chapter 12 of High Performance Parallelism Pearls discusses optimizing performance when offloading concurrent kernels (e.g. task-parallelism) to the Intel Xeon Phi coprocessor. The authors state, “Our ultimate optimization target in this chapter is to improve the computational throughput of multiple small-scale workloads on the Intel Xeon Phi coprocessor by concurrent kernel offloading.” Concurrent kernel offload targets application scenarios with many small-scale workloads that cannot individually exploit all the resources of the device. The chapter authors (Florian Wende, Michael Klemm, Thomas Steinke, and Alexander Reinefeld) discuss how the computational throughput for multiple small-scale workloads can be improved on the Intel Xeon Phi coprocessor through concurrent kernel execution using the offload programming model. Each of the optimization steps are elaborated and illustrated by working examples. Performance improvements are presented for two demonstration scenarios a particle force simulation and dgemm.

nbody_comparison_cpu

Performance of a PD simulation using Newton’s 3rd law for the force computation with and without concurrent offloading. The performance is additionally compared against a dual socket Intel Xeon processor (Sandy Bridge) execution using 32 threads. The dotted horizontal line is for the non-concurrent case. The solid line is for the execution using two thread groups of size 16, residing on separate CPU sockets. (courtesy Morgan Kaufmann)

Cover3D-fs8Note that GCC is likely to start supporting Intel Xeon Phi offload semantics in 2015.

Chapter Authors

Florian Wende

Florian Wende

Florian is a part of the Scalable Algorithms workgroup  (department Distributed Algorithms and Supercomputing) at Zuse Institute Berlin (ZIB). e is interested in accelerator and many-core computing with application in Computer Science and Computational Physics. His focus is on load balancing of  irregular parallel computations and on close-to-hardware code optimization. He received a Diploma in Physics (Dipl.-Phys., equivalent to M.Sc.) in 2010 from the Humboldt-Universität zu Berlin, and a B.Sc. in Computer cience in 2013 from the Freie Universität Berlin.

Michael Klemm

Michael Klemm

Dr. Michael Klemm is part of Intel’s Software and Services Group, Developer Relations Division. His focus is on High Performance and Throughput Computing. He obtained an M.Sc. in Computer Science and a Doctor of Engineering degree (Dr.-Ing.) in Computer Science from the Friedrich-Alexander-Universität Erlangen-Nürnberg. Michael’s areas of interest includes compiler construction, design of programming languages, parallel programming, and performance analysis and tuning. Michael is Intel representative in the OpenMP Language Committee and leads the efforts to develop error handling features for OpenMP.

Thomas Steinke

Thomas Steinke

Thomas Steinke is head of the Supercomputer Algorithms and Consulting group at the Zuse Institute Berlin (ZIB). He received his Doctorate in Natural Sciences (Dr. rer. nat.) in 1990 from the Humboldt University of Berlin. His research interest is in high-performance computing, heterogeneous systems for scientific and data analytics applications, and parallel simulation methods. Thomas leads the IPCC at ZIB since 2013 and he was co-founder of the OpenFPGA initiative in 2004.

Alexander Reinefeld

Alexander Reinefeld

Alexander Reinefeld is the head of the Computer Science Department at Zuse Institute Berlin (ZIB) and a professor at the Humboldt University of Berlin. He received his PhD and MSc from the University of Hamburg in 1987 and 1982, respectively. He has been awarded a PhD scholarship by the German Academic Exchange Service and a Sir Izaak Walton Killam Post Doctoral Fellowship from the University of Alberta, Canada. Alexander has served as an assistant professor at University of Hamburg and as the managing director at the Paderborn Center for Parallel Computing. He co-founded the North German Supercomputing Alliance (HLRN), the European Grid Forum, the Global Grid Forum, and the German e-Science initiative D-Grid. His research interrest is in distributed computing, high-performance computer architecture, scalable and dependable computing, and peer-to-peer algorithms. He published numerous scientific papers and holds patents on logarithmic routing protocols.

Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.

Share this:

  • Twitter

Filed Under: Featured article, Featured news, News, News, Xeon Phi Tagged With: HPC, Intel, Intel Xeon Phi, x86

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Accelerating the Traveling Salesman Problem with GPUs and Intel Xeon Phi
  • CUDA 340.29 Driver Significantly Boosts GPU Performance (100s GF/s For Machine-Learning)
  • Remote Teaching Rooms Available At SC14
  • SenseHUD $99 Heads Up Display for Cars - Pre-Order Price
  • NVIDIA Moves Deeper into the Data Center with the P4 and P40 Inference GPUs

Archives

© 2026 · techenablement.com