• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / Featured article / Sparse matrix-vector multiplication: parallelization and vectorization

Sparse matrix-vector multiplication: parallelization and vectorization

November 10, 2014 by Rob Farber Leave a Comment

The chapter authors (Albert-Jan N. Yzelman, Dirk Roose, and Karl Meerbergen) note that, “Current hardware trends lead to an increasing width of vector units as well as to decreasing effective bandwidth-per-core. For sparse computations these two trends conflict.”  For this reason they designed a usable and efficient data structure for vectorized sparse computations  on multi-core architectures with vector processing capabilities – like Intel Xeon Phi. This data structures helps with the difficulties in achieving a high performance for sparse matrix–vector (SpMV) multiplications caused by a low flop-to-byte ratio and inefficient cache use. Results are presented for sparse matrix multiplication and transpose.

Cover3D-fs8

The final vectorized BICRS data structure that separates the relative encoding of pq −1 nonzeroes in each block from the BICRS encoding of each of the leading nonzeroes of all blocks

  • These two encodings are then combined in the final vectorized data structure.
  • The figure illustrates 2×2 blocking on an 4×4 matrix with 10 nonzeroes ordered according to a Hilbert curve.
  • This results in 4 blocks, containing 6 explicit zeroes.
Illustration of the final vectorized BICRS data structure. (Courtesy Morgan Kaufmann)

Illustration of the final vectorized BICRS data structure. (Courtesy Morgan Kaufmann)

The following show performance improvements for sparse matrix multiplication for Intel Xeon Phi and Xeon using optimizations discussed in this chapter. The baseline is OpenMP CRS, successively augmented with:

  • Partial data distribution
  • sparse blocking with Hilbert ordering
  • Vectorized BICRS data structure.
Screenshot from 2014-11-06 15:19:48

Courtesy Morgan Kaufmann

Screenshot from 2014-11-06 15:49:00

Courtesy Morgan Kaufmann

The following show performance optimizations for sparse matrix transpose for Intel Xeon Phi and Xeon. The performance of each possible blocking size (1×4, 2×2, and 4×1) is compared against results obtained using onvectorized BICRS (1×1).

Screenshot from 2014-11-07 01:35:39

Courtesy Morgan Kaufmann

 

C27_bicrs_xeon

Courtesy Morgan Kaufmann

Chapter Authors

Albert-Jan N. Yzelman

Albert-Jan N. Yzelman

Albert-­Jan is a postdoctoral researcher at the department of Computer Science, KU Leuven, Belgium. He works within the ExaScience Life Lab on sparse matrix computations, high performance computing, and general parallel programming

Dirk Roose

Dirk Roose

Dirk is a professor at the department of Computer Science, KU Leuven, Belgium. His research focuses on numerical methods for computational science and engineering and on algorithms for parallel scientific computing.

Karl Meerbergen

Karl Meerbergen

 Karl is a professor at the department of Computer Science, KU Leuven, Belgium. His research focuses on large scale numerical linear algebra.

Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.

Share this:

  • Twitter

Filed Under: Featured article, Featured news, News, Xeon Phi Tagged With: HPC, Intel, Intel Xeon Phi, x86

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Recovering Speech from a Potato-chip Bag Viewed Through Soundproof Glass - Even With Commodity Cameras!
  • DARPA Goals, Requirements, and History of the SyNAPSE Project
  • Call for Papers: Women in HPC at Supercomputing 2014 due July 31
  • HTML5 Progress - Confirmed Netflix Works With Chrome And Ubuntu 14.04LTS
  • TechEnablement Becomes an SC14 Media Partner

Archives

© 2025 · techenablement.com