• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / Featured article / Fine-Tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors

Fine-Tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors

February 2, 2015 by Rob Farber Leave a Comment

Andrey Vladimirov at ColFax International has posted source code and a paper, “Fine-Tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors: LU Decomposition of Small Matrices” on the ColFax site. Andrey notes, “Benchmarks show that the discussed optimizations improve the application performance on the coprocessor by a factor of 2.8 compared to the unoptimized code, and by a factor of 1.7 on the multi-core host system, achieving roughly the same performance on the host and on the coprocessor“. He uses a common LU decomposition,  the Doolittle algorithm of LU decomposition, which  is commonly used to solve systems of linear algebraic equations. The value of the Intel Xeon Phi when performance equal to an Intel Processor is the better energy efficiency of the Intel Xeon Phi.

Image courtesy Colfax International

Image courtesy Colfax International

The paper draws two conclusions:

  1. On the host’s Intel Xeon CPU, the high-level language code performs on par with the industry leading MKL implementation at our target matrix size 128×128. For smaller matrices, the code is actually more efficient than MKL, however, for larger matrices, MKL performs better.
    • Indeed, optimization techniques such as loop regularization are only important for short loops; for longer loops, other strategies may be used, such as tiling the j-loop or multiple loops.
  2. On the Intel Xeon Phi coprocessor, the MKL implementation loses by a large factor to both the MKL code on the CPU, and to our high-level language code on CPU and coprocessor. It indicates that the MKL code, likely hand-tuned with explicit assembly or intrinsics, is not portable to the MIC architecture, while our high-level language approach with tuning for the CPU also results in high performance on the coprocessor.
    • While the MKL developers have yet to approach optimizing ?getrf() for Intel Xeon Phi coprocessors, we developed for two platforms almost for the effort of one.
    • The word “almost” is used because we did tune the sizes of tiles and the order of loops separately for the CPU and MIC; however, even without this fine-tuning the loss of performance on either platform is relatively small. The reader can verify this using the code supplied with this paper.

Note that the latest release of MAGMA for Intel Xeon Phi adds LU decomposition along with Dense Matrix Factorization and Eigen-problem solvers.

Image courtesy ICL

Image courtesy ICL

Share this:

  • Twitter

Filed Under: Featured article, Featured tutorial, Tutorials, Tutorials, Xeon Phi Tagged With: HPC, Intel Xeon Phi

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • High Performance Ray Tracing With Embree On Intel Xeon Phi
  • MultiOS Gaming, Media, and OpenCL Using XenGT Virtual Machines On Shared Intel GPUs
  • Intel Xeon Phi Study Guide
  • Free Intermediate-Level Deep-Learning Course by Google

Archives

© 2025 · techenablement.com