• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / Featured article / Plesiochronous (Loosely Synchronous) Phasing Barriers To Avoid Thread Inefficiencies

Plesiochronous (Loosely Synchronous) Phasing Barriers To Avoid Thread Inefficiencies

October 9, 2014 by Rob Farber Leave a Comment

Jim Dempsey bests expert Intel programmers by 40% – 50% simply by using a little bit of ingenuity, along with a slightly different programming technique. He notes that, “a substantial portion of previously lost thread barrier wait time” can be recovered simply by using loosely synchronous (plesiochronous) barriers instead of strictly synchronous barriers.  Jim points out that, “those [Intel] programmers are likely much better that I am at program optimization, I merely saw an opportunity they missed.” You too have the opportunity to increase application performance with plesiochronous phasing barriers.  Not just for Intel Xeon Phi, Jim notes thatthe optimizations in his  High Performance Parallelism Pearls chapter, “are equally applicable to programming processors”.

Cover3D-fs8

Xeon_Phi_plesiochronous

Xeon Phi 5110P results (courtesy Morgan Kaufmann)

Xeon_plesiochronous

Xeon results (courtesy Morgan Kaufmann)

The numbers in the preceding graphs, “diffusion code speedups” represent the ratio of the identified program results versus the single threaded ‘base’ program results. All results are averaged to remove timing artifacts. Jim also notes that the figure is not a scaling chart where the number of cores  or threads change, but a comparison the performance benefits as the implementation is tuned to take advantage of the full computational capability of the coprocessor.

The optimizations performed used to take full advantage of the coprocessor are discussed in detail in his chapter and include:

  • base: single thread version of the program
  • omp: simplified conversion to parallel program
  • ompvect: adds simd vectorization directives
  • peel: removes unneeded code from the inner loop
  • tiled: partitions work to improve cache hit ratios

Author

Jim Dempsey

Jim Dempsey

Mr. Dempsey began programming in 1967-­‐1968 with a Digital Equipment Corporation PDP8/L (4K word, 10cps paper tape for storage). Worked at DEC 1972-­‐1974 in support of operating systems (OS/8, COS300, RT11). Joined Educomp Corp. in 1974 and wrote the ETOS operating system for PDP8-­‐E. Formed first corporation, Network-­‐Systems Design, Inc. in 1977 and wrote OMNI-­‐8 operating system (8-­‐way cluster and networked O/S). Formed several privately owned companies (Fox Valley Data Services Inc., TapeDisk Corporation, eNoMonie Inc., and QuickThread Programming, LLC.) between then and now in the software development and services area, as well as providing consulting services. Now serving as a consultant, specializing in High Performance computing as well as embedded systems. Position: President of QuickThread Programming, LLC. Extensive programming Chapter Title experiences in operating system, device drivers, utilities, compute intensive applications. Strong skills with assembler, C/C++, Fortran. Some experience with C# and Java. Comfortable working on SMP systems running Windows or Linux. Highly efficient at optimization on Xeon and Xeon Phi processors. Available for consulting: jim@quickthreadprogramming.com

Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.

Share this:

  • Twitter

Filed Under: Featured article, Featured news, News, Xeon Phi Tagged With: HPC, Intel, Intel Xeon Phi, x86

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Face It: AI Gets Personal to Make You Look Better!
  • CUDA Study Guide
  • Apache Spark Claims 10x to 100x Faster than Hadoop MapReduce
  • PyFR: A GPU-Accelerated Next-Generation Computational Fluid Dynamics Python Framework
  • Paper Compares AMD, NVIDIA, Intel Xeon Phi CFD Turbulent Flow Mesh Performance Using OpenMP and OpenCL

Archives

© 2023 · techenablement.com