The Unabridged Chapter 1 Introduction To High Performance Parallelism Pearls

Following is the full, unabridged text of the chapter 1 introduction (written by James Reinders) to High Performance Parallelism Pearls. Thanks to Morgan Kaufmann, James Reinders, and Jim Jeffers for giving permission so TechEnablment can make this available. After reading what James wrote, you will see that summarizing the introduction would simply have left out too much information that you probably wish to know about the contents of High Performance Parallelism Pearls.

Click to see the overview article “Teaching The World About Intel Xeon Phi” that contains a list of TechEnablement links about why each chapter is considered a “Parallelism Pearl” plus information about James Reinders and Jim Jeffers, the editors of High Performance Parallelism Pearls.

The following is reprinted courtesy of Morgan Kaufmann from “High Performance Parallelism Pearls”

CHAPTER 1: INTRODUCTION

by James Reinders, Intel Corporation

We should “create a cookbook” was a common and frequent comment that Jim Jeffers and I heard after Intel® Xeon Phi™ Coprocessor High-Performance Programming was published. Guillaume Colin de Verdière was early in his encouragement to create such a book and was pleased when we moved forward with this project. Guillaume matched action with words by also coauthoring the first contributed chapter with Jason Sewall, From “correct” to “correct & efficient”: a Hydro2D case study with Godunov’s scheme. Their chapter reflects a basic premise of this book that the sharing of experience and success can be highly educational to others. It also contains a theme familiar to those who program the massive parallelism of the Intel Xeon Phi family: running code on Intel Xeon Phi coprocessors is easy; which lets you quickly focus on optimization and the achievement of high performance—but we do need to tune for parallelism in our applications! Notably, we see such optimization work improves the performance on processors and coprocessors. As the authors note, “a rising tide lifts all boats.”

LEARNING FROM SUCCESSFUL EXPERIENCES

Learning from others is what this book is all about. This book brings together the collective work of numerous experts in parallel programming, to share their work. The examples were selected for their educational content, applicability, and success—and—you can download the codes and try the yourself! All the examples demonstrate successful approaches to parallel programming, but not all the examples scale well enough to make an Intel Xeon Phi coprocessor run faster than a processor. In the real world, this is what we face and reinforces something we are not bashful in pointing out: a common programming model matters a great deal. You see that notion emerge over and over in real-life examples, including those in this book.

We are indebted to the many contributors to this book. In this book, you find a rich set of examples and advice. Given that this is the introduction, we offer a little perspective to bind it together somewhat. Most of all, we encourage you to dive into the rich examples, found starting in Chapter 2.

CODE MODERNIZATION

It is popular to talk about “code modernization” these days. Having experienced the “inspired by 61 cores” phenomenon, we are excited to see it has gone viral and is now being discussed by more and more people. You will find lots of “modernization” shown in this book.

Code modernization is reorganizing the code, and perhaps changing algorithms, to increase the amount of thread parallelism, vector/SIMD operations, and compute intensity to optimize performance on modern architectures. Thread parallelism, vector/SIMD operations, and an emphasis on temporal data reuse are all critical for high-performance programming. Many existing applications were written before these elements were required for performance, and therefore, such codes are not yet optimized for modern computers.

MODERNIZE WITH CONCURRENT ALGORITHMS

Examples of opportunities to rethink approaches to better suit the parallelism of modern computers are scattered throughout this book. Chapter 5 encourages using barriers with an eye toward more concurrency. Chapter 11 stresses the importance of not statically decomposing workloads because neither workloads nor the machines we run them on are truly uniform. Chapter 18 shows the power of not thinking that the parallel world is flat. Chapter 26 juggles data, computation, and storage to increase performance. Chapter 12 increases performance by ensuring parallelism in a heterogeneous node. Enhancing parallelism across a heterogeneous cluster is illustrated in Chapter 13 and Chapter 25.

MODERNIZE WITH VECTORIZATION AND DATA LOCALITY

Chapter 8 provides a solid examination of data layout issues in the quest to process data as vectors. Chapters 27 and 28 provide additional education and motivation for doing data layout and vectorization work.

UNDERSTANDING POWER USAGE

Power usage is mentioned in enough chapters that we invited Intel’s power tuning expert, Claude Wright, to write Chapter 14. His chapter looks directly at methods to measure power including creating a simple software-based power analyzer with the Intel MPSS tools and also the difficulties of measuring idle power since you are not idle if you are busy measuring power!

ISPC AND OPENCL ANYONE?

While OpenMP and TBB dominate as parallel programming solutions in the industry and this book, we have included some mind-stretching chapters that make the case for other solutions. SPMD programming gives interesting solutions for vectorization including data layout help, at the cost of dropping sequential consistency. Is it that okay? Chapters 6 and 21 include usage of ispc and its SPMD approach for your consideration. SPMD thinking resonates well when you approach vectorization, even if you do not adopt ispc.

Chapter 22 is written to advocate for OpenCL usage in a heterogeneous world. The contributors describe results from the BUDE molecular docking code, which sustains over 30% of peak floating point performance on a wide variety of systems.

INTEL XEON PHI COPROCESSOR SPECIFIC

While most of the chapters move algorithms forward on processors and coprocessors, three chapters are dedicated to a deeper look at Intel Xeon Phi coprocessor specific topics. Chapter 15 presents current best practices for managing Intel Xeon Phi coprocessors in a cluster. Chapters 16 and 20 give valuable insights for users of Intel Xeon Phi coprocessors.

MANY-CORE, NEO-HETEROGENEOUS

The adoption rate of Intel Xeon Phi coprocessors has been steadily increasing since they were first introduced in November 2012. By mid-2013, the cumulative number of FLOPs contributed by Intel Xeon Phi coprocessors in TOP 500 machines exceeded the combined FLOPs contributed by all the graphics processing units (GPUs) installed as floating-point accelerators in the TOP 500 list. In fact, the only device type contributing more FLOPs to TOP 500 supercomputers was Intel Xeon® processors.

As we mentioned in the Preface, the 61 cores of an Intel Xeon Phi coprocessor have inspired a new era of interest in parallel programming. As we saw in our introductory book, Intel Xeon Phi Coprocessor High-Performance Programming, the coprocessors use the same programming languages, parallel programming models, and the same tools as processors. In essence, this means that the challenge of programming the coprocessor is largely the same challenge as parallel programming for a general-purpose processor. This is because the design of both processors and the Intel Xeon Phi coprocessor avoided the restricted programming nature inherent in heterogeneous programming when using devices with restricted programming capabilities.

The experiences of programmers using the Intel Xeon Phi coprocessor time and time again have reinforced the value of a common programming model—a fact that is independently and repeatedly emphasized by the chapter authors in this book. The take-away message is clear that the effort spent to tune for scaling and vectorization for the Intel Xeon Phi coprocessor is time well spent for improving performance for processors such as Intel Xeon processors.

NO “XEON PHI” IN THE TITLE, NEO-HETEROGENEOUS PROGRAMMING

Because the key programming challenges are generically parallel, we knew we needed to emphasize the applicability to both multicore and many-core computing instead of focusing only on Intel Xeon Phi coprocessors, which is why “Xeon Phi” does not appear in the title of this book.

However, systems with coprocessors and processors combined do usher in two unique challenges that are addressed in this book: (1) Hiding the latency of moving data to and from an attached device, a challenge common to any “attached” device including GPUs and coprocessors. Future Intel Xeon Phi products will offer configurations that eliminate the data-movement challenge by being offered as processors instead of being packaged coprocessors. (2) Another unique and broader challenge lies in programming heterogeneous systems.

Previously, heterogeneous programming referred to systems that combined incompatible computational devices. Incompatible in that they used programming methods different enough to require separate development tools and coding approaches. The Intel Xeon Phi products changed all that. Intel Xeon Phi coprocessors offer compatible coding methods for parallel programming with those used by all processors. Intel customers have taken to calling this “neo-heterogeneity” to stress that the sought after value of heterogeneous systems can finally be obtained without the programming being heterogeneous. This gives us the highly desirable homogeneous programming on neo-heterogeneous hardware (e.g., use of a common programming model across the compute elements, specifically the processors and coprocessors).

THE FUTURE OF MANY-CORE

Intel has announced that it is working on multiple future generations of Intel Xeon Phi devices, and has released information about the second generation product, code named Knights Landing.

The many continued dividends of Moore’s Law are evident in the features of Knights Landing. The biggest change is the opportunity to use Knights Landing as a processor. The advantages of being available as a processor are numerous, and include more power efficient systems, reduction in data movement, a large standard memory footprint, and AVX-512 (a processor-compatible vector capability nearly identical to the vector capabilities found in the first-generation Intel Xeon Phi coprocessor).
Additional dividends of Moore’s Law include use of a very modern out-of-order low-power core design, support for on-package high-bandwidth memory, and versions that integrate fabric support (inclusion of a Network Interface Controller).

The best way to prepare for Knights Landing is to be truly “inspired by 61 cores.” Tuning for today’s Intel Xeon Phi coprocessor is the best way to be sure an application is on track to make good use of Knight Landing. Of course, since Knights Landing promises to be even more versatile it is possible that using a processor-based system with more than 50 cores is a better choice to get your application ready. We say more than 50 because we know from experience that a small number of cores does not easily inspire the level of tuning needed to be ready to utilize high levels of parallelism.

We can logically conclude that the future for many-core is bright. Neo-heterogeneous programming is already enabling code modernization for parallelism, while trending to get easier in each generation of Intel Xeon Phi devices. We hope the “recipes” in this book provide guidance and motivation for modernizing your applications to take advantage of highly parallel computers.

DOWNLOADS

During the creation of this book, we specifically asked chapter authors to focus on important code segments that highlight the key concepts in their chapters. In addition, we required that chapters include complete examples, which are available for download from our Web site http://lotsofcores.com or project web sites. Section “For more information” at the end of each chapter will steer you to the right place to find the code.

Instructors, and all who create presentations, can appreciate the value in being able to download all figures, diagrams and photos used in this book from http://lotsofcores.com. Please reuse them to help teach and explain parallel programming to more software developers! We appreciate being mentioned when you attribute the source of figures but otherwise we will place no onerous conditions on their reuse.

FOR MORE INFORMATION

Some additional reading worth considering includes:

Downloads associated with this book: http://lotsofcores.com.
Web site sponsored by Intel dedicated to parallel programming: http://go-parallel.com.
Intel Web site for information on the Intel Xeon Phi products: http://intel.com/xeonphi.
Intel Web site for information on programming the Intel Xeon Phi products: http://intel.com/software/mic.
Intel Xeon Phi Users Group Web site: https://www.ixpug.org.
Advanced Vector Instructions information: http://software.intel.com/avx.