The paper draws two conclusions:
- On the host’s Intel Xeon CPU, the high-level language code performs on par with the industry leading MKL implementation at our target matrix size 128×128. For smaller matrices, the code is actually more efficient than MKL, however, for larger matrices, MKL performs better.
- Indeed, optimization techniques such as loop regularization are only important for short loops; for longer loops, other strategies may be used, such as tiling the j-loop or multiple loops.
- On the Intel Xeon Phi coprocessor, the MKL implementation loses by a large factor to both the MKL code on the CPU, and to our high-level language code on CPU and coprocessor. It indicates that the MKL code, likely hand-tuned with explicit assembly or intrinsics, is not portable to the MIC architecture, while our high-level language approach with tuning for the CPU also results in high performance on the coprocessor.
- While the MKL developers have yet to approach optimizing ?getrf() for Intel Xeon Phi coprocessors, we developed for two platforms almost for the effort of one.
- The word “almost” is used because we did tune the sizes of tiles and the order of loops separately for the CPU and MIC; however, even without this fine-tuning the loss of performance on either platform is relatively small. The reader can verify this using the code supplied with this paper.
Note that the latest release of MAGMA for Intel Xeon Phi adds LU decomposition along with Dense Matrix Factorization and Eigen-problem solvers.