In this article we focus on Intel® MPI and how it supports scalable distributed machine learning both now and in the future. This is the fourth in a multi-part series on Faster Deep Learning with the Intel Scalable System Framework (Intel® SSF).
The Intel® MPI library provides programmers a “drop-in” MPICH replacement library that can deliver the performance benefits of the Intel® Omni-Path Architecture (Intel® OPA ) communications fabric plus high core count Intel® Xeon and Intel® Xeon Phi™ processors. MPI stands for Message Passing Interface. Tests have verified the scalability of the Intel MPI implementation to 340,000 MPI ranks  where a rank is a separate MPI process that can run on a single core or an individual system. Other communications fabrics such as InfiniBand are supported plus programmers can recompile their applications to use the Intel MPI library.
MPI is a key communications layer for many scientific and commercial applications including machine and deep learning applications. In general, all distributed communications pass through the MPI API (Application Programming Interface), which means compliance and performance at scale are both critical.
This article will focus on key aspects of Intel MPI library that affect machine and deep learning including: (1) Intel specific tuning and MPICH compatibility, (2) hybrid MPI programming support, and (3) the scalability and performance of the key global broadcast and reduction operations for machine learning. MPI startup time will be discussed as the MPI_Init() method first called by all MPI applications with tens to hundreds of thousands of MPI ranks can have non-trivial runtime, which also highlights the need for good MPI profiling support.
Intel specific tuning and compatibility testing
The Intel MPI team has spent a significant amount of time tuning the Intel MPI library to different processor families plus network types and topologies including Intel OPA, Intel® Xeon, and Intel® Xeon Phi™.
Shared memory is particularly important on high core count processors as data can be shared between cores without the need for a copy operation. DMA mapped memory and RDMA (Remote Direct Memory Access) operations are also utilized to prevent excess data movement. Only when required are optimized memory copy operations are utilized depending on the processor uArch. This is the case with very small messages. Special support is also provided for the latest Intel Xeon Phi processor near and far memory. Near memory is close to the processor and resides in fast MCDRAM memory while far memory is further away from the processor that resides in conventional DDR4 memory.
MPI is a fundamental building block, which is why Intel follows a quarterly release cycle to ensure that customers have access to all the latest performance optimizations. To ensure strict compatability, the Intel MPI team runs aggressive regression tests to ensure that all configurations run correctly, quickly, and are conformant to the MPICH and ABI specifications. Regression test suites run nightly and weekly using a more comprehensive test suite.
Hybrid MPI/Multi-threaded support for high-core count processors
Intel has verified the scalability of the Intel MPI implementation to 340,000 MPI ranks . Each MPI rank is a separate process that can run on a single core or in a hybrid multi-threaded model where each MPI process uses threads to take advantage of multiple cores in a node. Machine learning and deep learning applications utilize a hybrid MPI/multi-threaded model to exploit all the vector and parallel capabilities of the hardware. An animated version hybrid version is illustrated below.
Many legacy HPC MPI applications are transitioning to the hybrid model to better exploit vector floating-point capabilities. For example, utilizing the full AVX-512 floating-point capability can result in a potential 8x speedup over a non-vector floating-point dominated code. Machine learning training is very floating-point intensive and parallelizes well, which means it fits well in the lower right hand side of the following Intel performance schematic.The Intel MPI library supports the latest MPI-3 standard and is binary compatible with existing MPI-1.x and MPI-2.x applications. This means that even legacy HPC applications can use the Intel MPI library without recompiling to run efficiently on the latest generation Intel hardware.
Further, the Intel MPI effort is an active participant in the MPICH ABI Compatibility Initiative. The MPICH Application Binary Interface, or ABI, is the low-level interface that determines such details as how functions are called and the size, layout and alignment of datatypes. With ABI compatibility, programs conform to the same set of runtime conventions, which ensures that any MPICH-compiled application – regardless of which vendor library was used for compilation – can use the Intel MPI runtime.
Broadcast performance and scalability
Deep learning networks can contain millions of parameters that must be broadcast to all the distributed processors during each step of the training procedure. For example, the ImageNet deep convolutional neural network that is used to classify 1.2 million high-resolution images contains 60 million parameters . Very deep convolutional networks can contain 138 million parameters . Each of these parameters can be expressed with a four-byte single-precision or eight-byte double-precision number.
The following graph shows how the Intel MPI team has achieved an 18.24x improvement over OpenMPI. This added performance can help to speed the training of deep and very-deep image recognition neural networks. In particular, the 524,288 message size can broadcast the parameters for big convolutional neural networks.
Reduction performance and scalability
Reduction performance becomes ever more important as the number of computational ranks in the job increases. In particular, data scientists can decrease the amount of data per computational node to decrease the time it takes each node to perform a training step as discussed in the second article in this series. However, the MPI_Reduce() library call that is used to perform the reduction can become the rate limiting step as the number of computational nodes (e.g. MPI ranks) increases.
The Intel MPI team has tuned the reduction operations to deliver greater performance than the OpenMPI MPI_Reduce() implementation. Reductions are particularly tricky to optimize as they tend to be latency rather than bandwidth limited and utilize small messages. For example, machine learning (and many other HPC applications) tend to perform arithmetic reductions using double-precision values to preserve as much precision as possible. This means that potentially large numbers of 8-byte messages can flood the communications fabric.
Not much can be done about the speed-of-light latency limitations of the communications fabric. However, the Intel MPI library can pick among a variety of algorithms and tuned implementations depending on the network topology and processor architecture to reduce both the number messages transmitted and the software latency of the library itself. Shared memory in particular can be used on high-core count processors to ensure that only a single value needs to be transmitted from each node.
The following graph shows a 1.34x performance improvement when using the Intel MPI library as opposed to OpenMPI. For reduction limited applications, this translates to a significant time-to-model improvement simply by “dropping in” the Intel MPI library for MPICH compatible binaries (or simply recompile to transition from non-MPICH libraries like OpenMPI).
MPI startup time
Running large scale Intel MPI applications means that close attention needs to be paid to the MPI_Init() routine. Running with many thousands of ranks means that infrastructure management operations can consume a large part of the MPI initialization time. In other words, the initialization of the MPI environment to provide all the ranks with a common, consistent environment must scale as well.
There are several factors which lead to the increased startup time. This includes extra communication over the PMI (Process Management Interface) before the fabric is available. In addition there are initial global- collective operations which may lead to high fabric load during the startup phase. The amount of messages passed across the fabric can increase dramatically as the MPI rank counts increase, thus causing long startup times at scale. Ensuring fast startup times is one of many reasons why Intel validated their MPI library to 340,000 MPI ranks. Succinctly new approaches and algorithms are required for even seemingly mundane tasks like starting large numbers of MPI tasks at scale.
Profiling and tuning
Intel provides a number of MPI profiling tools such as the Intel® Trace Analyzer and Collector and the MPI Performance Snapshot tool. The latter is especially important for examining MPI behavior at scale as it lets developers understand performance when scaling out to thousands of ranks. The MPI Performance Snapshot tool combines lightweight statistics from the Intel® MPI Library with OS and hardware-level counters to categorize applications including reports of MPI vs. OpenMP load imbalance, memory usage, and a break-down of MPI vs. OpenMP vs. serial time.
MPI offers programmers the ability to compute in distributed environments with high efficiency within a variety of environments from individual machines to organizational clusters and the world’s largest supercomputers. MPI also runs in the cloud and will certainly be available on exascale class supercomputers. The ability to run in most environments plus the availability of highly optimized libraries like the Intel MPI library that support efficient global broadcast and reduction operations makes MPI a natural distributed computing framework for machine- and deep-learning. Binary compatibility means that customers of the Intel MPI library will stay current with the latest optimizations and performance tuning by Intel, plus most customers will never exceed the validated scaling envelope.
For More Information
Volume 21 of the Intel magazine “The Parallel Universe” that is dedicated to MPI provides an excellent source of additional information for those interested in more details about the Intel MPI library.
 Chapter 7, “Deep-Learning Numerical Optimization”, High Performance Parallelism Pearls, volume 1, Morgan Kaufmann, 2014, ISBN 9780128021187.
 Very Deep Convolutional Networks for Large-Scale Image Recognition: http://arxiv.org/pdf/1409.1556.pdf