Fine-Tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors

Andrey Vladimirov at ColFax International has posted source code and a paper, “Fine-Tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors: LU Decomposition of Small Matrices” on the ColFax site. Andrey notes, “Benchmarks show that the discussed optimizations improve the application performance on the coprocessor by a factor of 2.8 compared to the unoptimized code, and by a factor of 1.7 on the multi-core host system, achieving roughly the same performance on the host and on the coprocessor“. He uses a common LU decomposition, the Doolittle algorithm of LU decomposition, which is commonly used to solve systems of linear algebraic equations. The value of the Intel Xeon Phi when performance equal to an Intel Processor is the better energy efficiency of the Intel Xeon Phi.

Image courtesy Colfax International

The paper draws two conclusions:

On the host’s Intel Xeon CPU, the high-level language code performs on par with the industry leading MKL implementation at our target matrix size 128×128. For smaller matrices, the code is actually more efficient than MKL, however, for larger matrices, MKL performs better.
- Indeed, optimization techniques such as loop regularization are only important for short loops; for longer loops, other strategies may be used, such as tiling the j-loop or multiple loops.
On the Intel Xeon Phi coprocessor, the MKL implementation loses by a large factor to both the MKL code on the CPU, and to our high-level language code on CPU and coprocessor. It indicates that the MKL code, likely hand-tuned with explicit assembly or intrinsics, is not portable to the MIC architecture, while our high-level language approach with tuning for the CPU also results in high performance on the coprocessor.
- While the MKL developers have yet to approach optimizing ?getrf() for Intel Xeon Phi coprocessors, we developed for two platforms almost for the effort of one.
- The word “almost” is used because we did tune the sizes of tiles and the order of loops separately for the CPU and MIC; however, even without this fine-tuning the loss of performance on either platform is relatively small. The reader can verify this using the code supplied with this paper.

Image courtesy ICL