The chapter discuss characterization and optimization methodology applied to a 3D finite differences (3DFD) algorithm used to solve constant or variable density isotropic acoustic wave equation (Iso3DFD). From an unoptimized version to the most optimized, the authors achieved a six-fold performance improvement on Intel Xeon E5-2697v2 processors and a nearly thirty-fold improvement on Intel Xeon Phi coprocessors.
In addition to the discussion on performance improvement techniques, the most important takeaway from this chapter is the three step methodology for ensuring high efficiency:
- Estimate the best achievable performance, even before starting tuning
- Tune the code for parallelism, data locality and vectorization
- Auto-tune the best set of parameters for both the building and the running-time phases
Starting from the most basic implementation of the 3DF, the authors describe a methodology to estimate the best performance an algorithm can achieve based on algorithm and hardware characterization.
The rooflines are represented in these images for the upper theoretical and the achievable limits of the platform, respectively. Horizontal lines represent the maximum achievable peaks when considering the (#ADD, #M U L) imbalance and when weighted by the Stream triad bandwidth. The vertical line represents the arithmetic intensity of our iso3DFD kernel. Intersection with the others lines gives the corresponding achievable limits.
To obtain performance close to to this expected value, the authors discuss a series of tuning steps that range from basic to an implementation using hardware intrinsic functions.
- The tuning techniques described include
- Scalable parallelization (collaborative thread blocking)
- Maximizing memory bandwidth (cache blocking, register reuse)
- Maximizing in-core performance (vectorization, loop-redistribution)
An automatic tuning method is used to find optimal set of parameters that might be required at either application build time and/or execution time. These parameters usually come from source code changes (e.g., loop blocking values), compiler driven options (e.g., loop unrolling factor), and hardware characteristics (e.g., cache sizes).
This chapter also implements a genetic algorithm to search the space of available parameters including cache blocking sizes, domain decomposition shapes, prefetching flags and power consumption. The resulting tuning method is considerably faster than traditional exhaustive search techniques. In addition to performance improvement, the automatic tuning methodology selects an optimal parameter set for any input workloads.
Chapter Authors
Cédric is an application engineer in the Energy team at Intel Corporation. He helps optimize applications running on Intel platforms for the Oil and Gas industry.
Leo is a Senior Staff Engineer and has been engaged with the Intel Many Integrated Core program from its early days. He specializes in HPC applying his background in numerical analysis and in developing parallel numerical math libraries. Leo is focused on optimization work related to the Oil & Gas industry.
Philippe leads the Intel Energy Engineering Team supporting end users and software vendors in the energy sector. His work includes profiling and tuning of HPC applications for current and future platforms as well as for super-‐computer definition with respect to the applications behaviors. His research activity is devoted to performance extrapolation and application characterization and modeling toward Exascale computing
Gregg specializes in porting and optimizing science and engineering applications on parallel computers. Gregg joined Intel Corporation Software and Services Group in 2011.
Chuck is a Principal Engineer in the Software and Services Group at Intel Corporation, where he has been employed since 1995. He has professionally contributed in the areas of computer performance analysis and optimization (pre-‐Si and post-‐Si), object-‐oriented software design, machine learning, and computer architecture.
Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.
Leave a Reply