By utilizing the strengths of the Intel Xeon Phi coprocessor, the chapter 3 High Performance Parallelism Pearls authors were able to improve and modernize their code and “achieve great scaling, vectorization, bandwidth utilization and performance/watt”. The authors (Jacob Weismann Poulsen, Karthik Raman and Per Berg) note, “The thinking process and techniques used in this chapter have wide applicability: focus on data locality and then apply threading and vectorization techniques.”. In particular, they write about the advection routine from the HIROMB‐BOOS‐Model (HBM) which was initially underperforming on the Intel Xeon Phi coprocessor. However, they were able to achieve a 3x performance improvement after re-structuring the code which involved changing data structures for better data locality, exploiting the available threads and SIMD lanes for better concurrency at thread and loop level to utilize the maximum available memory bandwidth. To avoid data licensing issues the example code provided in High Performance Parallelism Pearls utilizes the Baffin Bay setup generated from the freely available ETOPO2 data set.
A very nice feature is the near-ideal scaling of the advection routine in the example code as seen in the figure below:
It is always important to mention the node performance either in memory bandwidth (if it’s memory bound) or in flops (if it’s bound on that) when showing node scale plots so that an inefficient implementation does not appear better than an efficient implementation. For this reason, the authors’ asked that I point out for the previous scaling plot:
- The HBM advection code attains 100% of peak BW performance on 2S IVB (and 2S HSW) and it attains 90% of peak BW performance on KNC (here peak is defined as the performance attained by the stream triad benchmark).
- More specifically, when the authors cross-compare 2S IVB (48 threads) with 1 KNC card (240 threads), they use the same algorithm on both architectures (e.g. they run with 1 MPI task using 48 OpenMP threads and 240 OpenMP threads, respectively).
- It should be stressed that the time to solution on IVB (or HSW) can be improved by using more MPI tasks and fewer threads (to get better data placement on the pages which increases the time to solution through better use of the bandwidth even though the sustained bandwidth cannot be improved.
- However, cross-comparing N tasks each using M threads on 24 IVB cores with 1 task and 240 threads on 60 KNC cores is really cross-comparing slightly different algorithms. For clarity, the authors avoided an apples to oranges comparison in their chapter. In reviewing this post, they noted that if one chooses to compare the slightly different implementations (after all, time to solution is all that the users care about) then it’s very important to distinguish between what performance due to hardware and that due to the algorithms.
- Secondly, it should be mentioned that 1 KNC card outperforms 2S IVB performance in pure threaded version of the run where performance is measured as time to solution.
(It is always a joy to work with researchers who perform careful comparative work! [Ed])
Jacob Weismann Poulsen is educated in computer science and mathematics from The University of Copenhagen and has since the early 2000’s worked as an HPC and scientific programming consultant for the research departments at DMI. Expertise in analyzing and optimizing applications within the meteorological field. Accomplished communicator of subjects within parallel programming.
Per Berg is educated in mathematical modelling and scientific computing and has since mid-‐80’s been developing modelling software first for applications in seismic exploration, later from 1993 for water environments (estuaries, ocean). Working for both private companies and public institutes, Per has been involved in numerous projects that applies models to solve engineering and scientific problems.
Karthik Raman is a Software Architect at Intel focusing primarily on Pre/post silicon performance analysis and optimization of HPC workloads on Intel MIC (Many Integrated Cores) architecture. His expertise include analyzing for optimal compiler code generation, vectorization and assessing key architectural features for performance. He is also involved in delivering transformative methods and tools to expose new opportunities and insights for MIC competitive differentiation.
Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.