Portable Performance with OpenCL On Intel Xeon Phi

This High Performance Parallelism Pearl show the potential for using the OpenCL™ standard parallel programming language to deliver portable performance on Intel Xeon Phi coprocessors, Xeon processors, and many-core devices such as GPUs from multiple vendors. This portable performance can be delivered from a single program without needing multiple versions of the code, an advantage of OpenCL over most other approaches available today. As proof of OpenCL’s ability to deliver performance portability, we describe results from the BUDE molecular docking code, which sustains over 30% of peak floating point performance on a wide variety of processors, including laptop CPUs, Xeon, Xeon Phi and GPUs. The authors also briefly discuss the relationship between OpenCL and NVIDIA’s CUDA as well as pragma-based programming such as OpenMP 4.0 and OpenACC.

The chapter authors present a case study for BUDE (Bristol University Docking Engine), a molecular dynamics-based code that was ported – in its entirety – to OpenCL with the deliberate aim of delivering performance portability across a wide range of CPUs, GPUs and accelerators. This deliberate policy means that only a single source code needs to be developed and maintained, but it relies on achieving good performance for the OpenCL code on the default CPU target devices, as well as other devices including the Intel Xeon Phi coprocessor and GPUs.

Dr. Richard Sessions and a team from Bristol University have been developing BUDE for many years. BUDE employs a novel atom‐atom based empirical free energy force field to accurately predict the relative binding free energies of interactions between two molecules. This ability means BUDE can be used to address three different problems: 1) virtual-screening-by-docking of millions of small molecules against a protein target (Figure 1-‐7); 2) binding-‐site detection by scanning the surface of a protein with a ligand (Figure 1-‐8); 3) protein-‐protein docking in real space by the systematic scanning of one protein surface against the other.

BUDE’s sustained performance running identical OpenCL source code across a wide range of many-core and multi-core devices. Performance is measured across a complete application run. (Courtesy Morgan Kaufmann)

Chapter Authors

Simon Mcintosh-Smith

Simon McIntosh-‐Smith leads the HPC research group at the University of Bristol in the UK. His background is in microprocessor architecture, with a 15 year career in industry at companies including Inmos, STMicroelectronics, Pixelfusion and ClearSpeed. McIntosh-‐Smith co-‐founded ClearSpeed in 2002 where, as Director of Architecture and Applications, he co-‐developed the first modern many-‐core HPC accelerators. In 2003 he led the development of the first accelerated BLAS/LAPACK and FFT libraries, leading to the creation of the first modern accelerated Top500 system, TSUBAME-‐1.0 at Tokyo Tech in 2006. He joined the University of Bristol in 2009 where his research focuses on many-‐core algorithms and performance portability, and fault tolerant software techniques to reach Exascale. He is a joint recipient of an R&D 100 award for his contribution to Sandia’s Mantevo benchmark suite, and in 2014 he was awarded the first Intel Parallel Computing Center in the UK. McIntosh-‐Smith actively contributes to the Khronos OpenCL heterogeneous many‐core programming standard.

Tim Mattson

Tim Mattson is a principle engineer in Intel’s Microprocessor and Programming Research laboratory. He is an old fashioned application programmer with experience in quantum chemistry, seismic signal processing, and molecular modeling and has used more parallel programming models than he can keep track of. Tim was part of the teams that created OpenMP and OpenCL. Most recently, he has been working on the memory and execution models for the next major revision of OpenCL (OpenCL 2.0). Tim has published extensively including the books Patterns for Parallel Programming (with B. Sanders and B. Massingill, Addison Wesley, 2004), An Introduction to Concurrency in Programming Languages (with M. Sottile and C. Rasmussen, CRC Press, 2009), and the OpenCL Programming Guide (with A Munshi, B. Gaster, J. Fung, and D. Ginsburg, Addison Wesley, 2011).

Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.

Chapter Authors

Share this:

Leave a Reply Cancel reply