• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / Featured article / Optimizing Gather/Scatter Patterns On Intel Xeon Phi

Optimizing Gather/Scatter Patterns On Intel Xeon Phi

October 14, 2014 by Rob Farber Leave a Comment

Many modern microarchitectures rely on single-instruction multiple-data (SIMD) execution to provide high compute capabilities in an energy efficient manner. Such microarchitectures including those employed by the most recent Intel Xeon processors and Intel Xeon Phi coprocessors are optimized and/or better suited to dealing with contiguous loads and stores than non-contiguous loads (i.e. gathers) and stores (i.e. scatters). Gather and scatter behavior is more complex than that of contiguous loads and stores (e.g. it may depend on how close together the data items being read/written are). While today’s compilers emit gathers and scatters where necessary, they cannot always fully optimize performance, since they do not have enough knowledge about the access pattern. Programmers who perform only traditional memory optimizations, leaving gathers and scatters completely to the compiler, are likely to leave a significant amount of performance on the table.

Cover3D-fs8

Due to their high cost relative to contiguous SIMD loads and stores, and since they cannot always be avoided, the authors discuss four optimization techniques for improving the performance of gather and scatter operations:

  • Using domain knowledge and sorting to improve temporal and spatial locality
  • Choosing an appropriate data layout for hot compute loops, based on their access patterns
  • Minimizing the number of instructions for each individual gather/scatter operation, through on-the-fly transposition between AoS and SoA layouts
  • Amortizing gather/scatter and transposition costs over multiple loop iterations (where possible).

On a representative molecular dynamics application, the authors demonstrate an improvement of approximately 2x between an  unoptimized and optimized application on Intel Xeon processors and Intel Xeon Phi coprocessors. The optimizations employed are not only the same for both processing platforms, but are also generally applicable and not specific to current-­‐generation Intel® microarchitectures. We expect them to apply to future products and to other processor designs, and therefore optimizing gather/scatter patterns in applications today will ensure good levels of performance in the future.

Chapter Authors

Simon J. Pennycook

Simon J. Pennycook

John Pennycook joined Intel as an application engineer in 2014, after graduating from the University of Warwick with a Ph.D. in High Performance Computing. His current work focuses on enabling developers to fully utilize the current generation of Intel® Xeon PhiTM coprocessors. His previous research focused on the optimization of applications (including molecular dynamics codes) for a wide range of different microarchitectures and hardware platforms, as well as the issues surrounding performance portability.

Christopher Hughes

Christopher Hughes

Chris Hughes is a researcher at Intel Labs. He received his Ph.D. degree from the University of Illinois at Urbana-­‐Champaign in 2003. His research interests are emerging workloads and computer architecture. He is currently helping to develop the next generation of microprocessors for compute-­‐ and data-­‐intensive applications, focusing on wide SIMD execution and memory systems for processors with many cores. He has published more than 30 papers in a variety of fields such as adaptive CPUs for energy efficiency, gather/scatter hardware, transactional memory, real-­‐time scheduling, dynamic task management, hardware prefetching, physical simulation, and speech recognition.

Mikhail Smelyanskiy

Mikhail Smelyanskiy

Mikhail Smelyanskiy is a Principal Engineer at Intel’s Parallel Computing Lab, part of Intel Research Labs in Santa Clara, CA. His main focus is on application-­‐driven parallel architecture research. Specifically, his work involves design, implementation and analysis (including competitive analysis) of parallel algorithms and workloads for the current and future generation parallel processor systems. His research in the areas of medical imaging, computational finance and more recently in fundamental high performance compute kernels, such as DGEMM (double precision matrix-­‐matrix multiplication), SpMVM (sparse matrix-­‐vector multiplication) and QCD (quantum chromodynamics), helped improve Intel architecture, as well as demonstrate its full performance potential. Prior to his work at Intel, he earned Ph.D. from the Department of Electrical Engineering and Computer Science at the University of Michigan, Ann Arbor in 2003, where he worked on hardware/software co-­‐design and compiler optimization.

Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.

Share this:

  • Twitter

Filed Under: Featured article, Featured news, News, News, Xeon Phi Tagged With: HPC, Intel, Intel Xeon Phi

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Run CUDA without Recompilation on x86, AMD GPUs, and Intel Xeon Phi with gpuOcelot
  • Plesiochronous (Loosely Synchronous) Phasing Barriers To Avoid Thread Inefficiencies
  • ARM64 with CUDA Early Access Boards Now Available

Archives

© 2025 · techenablement.com