First printed December 10, 2012 on Dr. Dobbs ( link )
Developers can reach a teraflop/s of number crunching power via one of several routes:
- Using pragmas to augment existing codes so they offload work from the host processor to the Intel Xeon Phi coprocessors(s)
- Recompiling source code to run directly on coprocessor as a separate many-core Linux SMP compute node
- Accessing the coprocessor as an accelerator through optimized libraries such as the Intel MKL (Math Kernel Library)
- Using each coprocessor as a node in an MPI cluster or, alternatively, as a device containing a cluster of MPI nodes.
From this list, experienced programmers will recognize that the Phi coprocessors support the full gamut of modern and legacy programming models. Most developers will quickly find that they can program the Phi in much the same manner that they program existing x86 systems. The challenge lies in expressing sufficient parallelism and vector capability to achieve high floating-point performance, as the Intel Xeon Phi coprocessors provide more than an order of magnitude increase in core count over the current generation quad-core processors. Massive vector parallelism is the path to realize that high performance.
The focus of this first article is to get up and running on Intel Xeon Phi as quickly as possible. Complete working examples will show that only a single offload pragma is required to adapt an OpenMP square-matrix multiplication example to run on a Phi coprocessor. Performance comparisons demonstrate that both the pragma-based offload model and using Intel Xeon Phi as an SMP processor compare favorably against the MKL library optimized for the host, and that the optimized Phi MKL library can easily deliver over a teraflop.