Many modern microarchitectures rely on single-instruction multiple-data (SIMD) execution to provide high compute capabilities in an energy efficient manner. Such microarchitectures including those employed by the most recent Intel Xeon processors and Intel Xeon Phi coprocessors are optimized and/or better suited to dealing with contiguous loads and stores than non-contiguous loads (i.e. gathers) and stores (i.e. scatters). Gather and scatter behavior is more complex than that of contiguous loads and stores (e.g. it may depend on how close together the data items being read/written are). While today’s compilers emit gathers and scatters where necessary, they cannot always fully optimize performance, since they do not have enough knowledge about the access pattern. Programmers who perform only traditional memory optimizations, leaving gathers and scatters completely to the compiler, are likely to leave a significant amount of performance on the table.
Due to their high cost relative to contiguous SIMD loads and stores, and since they cannot always be avoided, the authors discuss four optimization techniques for improving the performance of gather and scatter operations:
- Using domain knowledge and sorting to improve temporal and spatial locality
- Choosing an appropriate data layout for hot compute loops, based on their access patterns
- Minimizing the number of instructions for each individual gather/scatter operation, through on-the-fly transposition between AoS and SoA layouts
- Amortizing gather/scatter and transposition costs over multiple loop iterations (where possible).
On a representative molecular dynamics application, the authors demonstrate an improvement of approximately 2x between an unoptimized and optimized application on Intel Xeon processors and Intel Xeon Phi coprocessors. The optimizations employed are not only the same for both processing platforms, but are also generally applicable and not specific to current-‐generation Intel® microarchitectures. We expect them to apply to future products and to other processor designs, and therefore optimizing gather/scatter patterns in applications today will ensure good levels of performance in the future.
John Pennycook joined Intel as an application engineer in 2014, after graduating from the University of Warwick with a Ph.D. in High Performance Computing. His current work focuses on enabling developers to fully utilize the current generation of Intel® Xeon PhiTM coprocessors. His previous research focused on the optimization of applications (including molecular dynamics codes) for a wide range of different microarchitectures and hardware platforms, as well as the issues surrounding performance portability.
Chris Hughes is a researcher at Intel Labs. He received his Ph.D. degree from the University of Illinois at Urbana-‐Champaign in 2003. His research interests are emerging workloads and computer architecture. He is currently helping to develop the next generation of microprocessors for compute-‐ and data-‐intensive applications, focusing on wide SIMD execution and memory systems for processors with many cores. He has published more than 30 papers in a variety of fields such as adaptive CPUs for energy efficiency, gather/scatter hardware, transactional memory, real-‐time scheduling, dynamic task management, hardware prefetching, physical simulation, and speech recognition.
Mikhail Smelyanskiy is a Principal Engineer at Intel’s Parallel Computing Lab, part of Intel Research Labs in Santa Clara, CA. His main focus is on application-‐driven parallel architecture research. Specifically, his work involves design, implementation and analysis (including competitive analysis) of parallel algorithms and workloads for the current and future generation parallel processor systems. His research in the areas of medical imaging, computational finance and more recently in fundamental high performance compute kernels, such as DGEMM (double precision matrix-‐matrix multiplication), SpMVM (sparse matrix-‐vector multiplication) and QCD (quantum chromodynamics), helped improve Intel architecture, as well as demonstrate its full performance potential. Prior to his work at Intel, he earned Ph.D. from the Department of Electrical Engineering and Computer Science at the University of Michigan, Ann Arbor in 2003, where he worked on hardware/software co-‐design and compiler optimization.
Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.