• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / Featured article / Characterization And Optimization Methodology Applied To Stencil Computations

Characterization And Optimization Methodology Applied To Stencil Computations

November 4, 2014 by Rob Farber Leave a Comment

The  chapter discuss characterization and optimization methodology applied to a 3D finite differences (3DFD) algorithm used to solve constant or variable density isotropic acoustic wave equation (Iso3DFD). From an unoptimized version to the most optimized, the authors achieved a six-fold performance improvement on Intel Xeon E5-2697v2 processors and a nearly thirty-fold improvement on Intel Xeon Phi coprocessors.

Cover3D-fs8In addition to the discussion on performance improvement techniques, the most important takeaway from this chapter is the three step methodology for ensuring high efficiency:

  1. Estimate the best achievable performance, even before starting tuning
  2. Tune the code for parallelism, data locality and vectorization
  3. Auto-tune the best set of parameters for both the building and the running-time phases

Starting from the most basic implementation of the 3DF, the authors describe a methodology to estimate the best performance an algorithm can achieve based on algorithm and hardware characterization.

The roofline model for Iso3DFD on dual-socket Ivy Bridge. (Courtesy Morgan Kaufmann)

The roofline model for Iso3DFD on dual-socket Ivy Bridge. (Courtesy Morgan Kaufmann)

The Roofline Model of Iso3DFD for Xeon Phi 7120P coprocessor. (Courtesy Morgan Kaufmann)

The Roofline Model of Iso3DFD for Xeon Phi 7120P coprocessor. (Courtesy Morgan Kaufmann)

The rooflines are represented in these images for the upper theoretical and the achievable limits of the platform, respectively. Horizontal lines represent the maximum achievable peaks when considering the (#ADD, #M U L) imbalance and when weighted by the Stream triad bandwidth. The vertical line represents the arithmetic intensity of our iso3DFD kernel. Intersection with the others lines gives the corresponding achievable limits.

To obtain performance close to to this expected value, the authors discuss a series of tuning steps that range from  basic to an implementation using hardware intrinsic functions.

  • The tuning techniques described include
  • Scalable parallelization (collaborative thread blocking)
  • Maximizing memory bandwidth (cache blocking, register reuse)
  • Maximizing in-core performance (vectorization, loop-redistribution)

An automatic tuning method is used to find optimal set of parameters that might be required at either application build time and/or execution time. These parameters usually come from source code changes (e.g., loop blocking values), compiler driven options (e.g., loop unrolling factor), and hardware characteristics (e.g., cache sizes).

The performance of each version on 2S-E5 Ivy Bridge and the coprocessor. The most optimized version dev09 is also improved after genetic algorithm auto tuning.

The performance of each version on 2S-E5 Ivy Bridge and the coprocessor. The most optimized version dev09 is also improved after genetic algorithm auto tuning. (Courtesy Morgan Kaufmann)

This chapter also implements a genetic algorithm to search the space of available parameters including cache blocking sizes, domain decomposition shapes, prefetching flags and power consumption. The resulting tuning method is considerably faster than traditional exhaustive search techniques. In addition to performance improvement, the automatic tuning methodology selects an optimal parameter set for any input workloads.

Chapter Authors

Cedric Andreolli

Cedric Andreolli

Cédric is an application engineer in the Energy team at Intel Corporation. He helps optimize applications running on Intel platforms for the Oil and Gas industry.

Leonardo Borges

Leonardo Borges

Leo is a Senior Staff Engineer and has been engaged with the Intel Many Integrated Core program from its early days. He specializes in HPC applying his background in numerical analysis and in developing parallel numerical math libraries. Leo is focused on optimization work related to the Oil & Gas industry.

Philippe Thierry

Philippe Thierry

Philippe leads the Intel Energy Engineering Team supporting end users and software vendors in the energy sector. His work includes profiling and tuning of HPC applications for current and future platforms as well as for super-­‐computer definition with respect to the applications behaviors. His research activity is devoted to performance extrapolation and application characterization and modeling toward Exascale computing

Greg Skinner

Greg Skinner

Gregg specializes in porting and optimizing science and engineering applications on parallel computers. Gregg joined Intel Corporation Software and Services Group in 2011.

Chuck Yount

Chuck Yount

Chuck is a Principal Engineer in the Software and Services Group at Intel Corporation, where he has been employed since 1995. He has professionally contributed in the areas of computer performance analysis and optimization (pre-­‐Si and post-­‐Si), object-­‐oriented software design, machine learning, and computer architecture.

Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.

 

Share this:

  • Twitter

Filed Under: Featured article, Featured news, News, Xeon Phi Tagged With: HPC, Intel, Intel Xeon Phi, x86

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Recovering Speech from a Potato-chip Bag Viewed Through Soundproof Glass - Even With Commodity Cameras!
  • DARPA Goals, Requirements, and History of the SyNAPSE Project
  • Call for Papers: Women in HPC at Supercomputing 2014 due July 31
  • HTML5 Progress - Confirmed Netflix Works With Chrome And Ubuntu 14.04LTS
  • TechEnablement Becomes an SC14 Media Partner

Archives

© 2025 · techenablement.com