Sponsored Content

In this article we will focus on Intel® Omni-Path Architecture and the implications for distributed machine learning. This is the second in a multi-part series on Faster Deep Learning with the Intel Scalable System Framework (Intel® SSF).

Deep learning neural networks can be trained to solve complex pattern recognition tasks – sometimes even better than humans. However training a machine learning algorithm to accurately solve complex problems requires large amounts of data that greatly increases the computational workload. Scalable distributed parallel computing using a high-performance communications fabric is an essential part of the technology that makes the training of deep learning on large complex datasets tractable in both the data center and within the cloud. Distributed means the computation is distributed across one or more nodes.

Parallel distributed computing is a necessary part of machine learning as even the TF/s parallelism of a single Intel Xeon and Intel Xeon Phi™ workstation is simply not sufficient to train in a reasonable time on many complex machine learning training sets. Instead, numerous computational nodes must be connected together via high-performance, low latency communications fabrics like Intel Omni-Path (referred to as Intel OPA). The data scientist can choose a sufficient number of nodes to reduce the time to solution and provide sufficient memory capacity and floating-point performance. The following scaling curve shows the 2.2 PF/s average sustained performance the TACC supercomputer was able to deliver on a machine learning problem (e.g. a Principle Components Analysis using an autoencoder).

Figure 1: Near-linear scaling to 3,000 Intel Xeon Phi nodes on the TACC Supercomputer

Tightly coupled computations are one of the reasons why HPC systems designers spend large amounts of money on the communications fabric. A tightly coupled computation in a distributed environment means that nodes frequently rely on results from other nodes to the extent that the runtime of the computation will be significantly affected by slow nodes or network links. This effect can be seen in the slight bend in the Figure 1 scaling curve between 0 to 500 nodes, which has been attributed to the incorporation of additional layers of switches into the MPI application. In other words, we can see the impact on the scaling curve caused by the delay the data packets experience when they have to traverse extra switches to reach their destination. Denser Intel OPA switches will reduce this effect as the data packets will have to traverse, or hop, through fewer switches in a given MPI configuration.

Machine learning is but one example of a tightly coupled distributed computation where the small message traffic generated by a distributed network reduction operation can have a big impact on application performance. Such reduction operations are common in HPC codes, which is one of the reasons why people spend large amounts of money on the communications fabric in HPC and clout computing where the network can account for up to 30% of the cost of the machine. Increased scalability and performance at an economical price point explains the importance of Intel OPA to the HPC and machine learning communities as well as the cloud computing community.

Figure 2: Intel(R) OPA is designed to reduce network costs (Image courtesy Intel [1])

Complex problems require big data

Lapedes and Farber showed in their paper, “How Neural Networks Work” [2], that neural networks essentially learn to solve a problem by fitting a multi-dimensional surface to the example data. During training, the neural network uses its nonlinear functions to build and place bumps and valleys at the high and low points in the surface. Prediction works by interpolating between, or extrapolating from points on this surface. Empirically, the more complex the problem, the more bumpy the surface that must be fit, which increases the amount of data that must be presented during training to define the location and height of each point of inflection (e.g. bump or valley). In short: the more complex the problem, the greater the amount of data that is required to accurately represent the bumpy surface for training.

Happily, massively parallel distributed mappings that exhibit near-linear scalability such as the exascale-capable mapping by Farber (described in the previous article) give data scientists the ability to use high-performance interconnects such as the Intel OPA to use as many computational nodes as necessary to train on very large and complex datasets.

Joe Yaworski (Intel Director of Fabric Marketing for the HPC Group) notes that, “The Intel Omni-Path Architecture delivers a mixture of hardware and software enhancements that will optimize machine learning performance, as it’s specifically designed for low latency and high message rates that scales as the cluster size increases.” Benchmark results presented later in this paper validate Yarworski’s statement.

The Intel Omni-Path Architecture delivers a mixture of hardware and software enhancements that will optimize machine learning performance, as it’s specifically designed for low latency and high message rates that scales as the cluster size increases – Joe Yaworski (Intel Director of Fabric Marketing for the HPC Group)

Figure 3: Intel(R) Scalable System Framework (Image courtesy Intel)

Tuning machine learning for the fastest time to solution in a distributed environment

A data scientist can tune a training run to be as fast as possible in a distributed environment simply by using more nodes thus reducing the number of examples per distributed computational node for a fixed training set.

Briefly, the time it takes a node to calculate the partial error (shown in step 2 below) during training effectively depends on the number of network parameters and the number of training examples to be evaluated on the node. The size and configuration of the neural network architecture is fixed at the start of the training session, which means the runtime of the per-node calculation of the partial errors can be minimized simply by reducing the number of training examples per node. (Of course, there is a limit as using too few examples will waste parallel computing resources). Since the partial error calculation on each node is independent, the runtime consumed when computing the partial errors in a distributed environment for a fixed training set size will be decrease linearly as the number of nodes increases. In other words, calculating the partial errors for a fixed dataset will happen 10x faster when using ten nodes in a compute cluster or cloud instance and 10,000 times faster when using ten thousand nodes.

Figure 4: A massively parallel mapping

As the speed and number of the computational nodes increases, the training performance for a given training set will be dictated more and more by the performance characteristics of the communications network.

For example, the global broadcast of the model parameters (shown in step 1) will take constant time regardless of the number of distributed nodes used during training. The basic idea, much like that of radio or broadcast television, it that it takes the same amount of time to broadcast to one listener as it takes to broadcast to all the listeners.

Inside the distributed computer (be it a local cluster, cloud instance, or leadership class supercomputer), it is the overall throughput of the communications fabric that dominates the broadcast time. The raw bits per second transported by the network links is important as a both 100 Gb/s InfiniBand EDR and Intel OPA communications fabrics will communicate data faster than older, 40 Gb/s networks. As shown in the figure below, the raw 100 Gb/s number tells only part of the story as throughput also depends on how well the underlying protocol communicates data and performs error correction. What really matters for HPC application performance, be it global broadcast or point-to-point communication is the amount of data transported per unit time, which is the measured MPI bandwidth.

Figure 5: Performance of Intel(R) OPA vs. InfiniBand EDR on the Ohio State microbenchmarks* (Image courtesy Intel)

The higher MPI bandwidth compared to InfiniBand EDR can help to speed the training of deep learning neural networks that contain millions of network parameters, all of which need to be communicated quickly across the fabric to the distributed computational nodes.

Intel OPA has incorporated a number of features that preserves high-performance, robust distributed HPC computing at scale such as:

Packet Integrity Protection (PIP): PIP allows corrupted packets to be detected without increasing latency. According to the Intel publication, Transforming the Economics of HPC Fabrics with Intel® Omni-Path Architecture, The Intel OPA strategy eliminates the long delays associated with end-to-end error recovery techniques that require error notices and retries to traverse the full length of the fabric. In contrast, some recent InfiniBand implementations support link-level error correction through a Forward Error Correction (FEC). However, FEC introduces additional latency into the normal packet processing pipeline. Intel OPA provides similar levels of integrity assurance without the added latency.
Dynamic Lane Scaling: Each 100 Gbps Intel OPA link is composed of four 25 Gbps lanes. In traditional InfiniBand implementations, if one lane fails, the entire link goes down which will likely cause the HPC application to fail. In contrast, if an Intel OPA lane fails, the rest of the link remains up and continues to provide 75 percent of the original bandwidth.
On-Load Design: Intel OPA utilizes an on-load model that shifts some of the work to the processor. For example, connection address information is maintained in host memory so all inbound packets “hit” and can be processed with deterministic latency. Adapter cache misses are eliminated and routing pathways can be optimized during runtime to make better use of fabric resources. Meanwhile, CPU utilization has been significantly reduced with PSM (Performance Scaled Messaging) software as will be seen below.

Small message latency is the key to faster distributed machine learning

For network bound computations, the network reduction can be the rate limit step during training.

Reduction operations generally call highly optimized library methods such as MPI_Reduce() when running in an MPI environment. For machine learning, a reduction is used to combine the partial errors calculated on each distributed node into a single overall error used by the optimization method to determine a newer, more accurate set of neural network parameters. As can be seen in the Figure 4 animation, the new parameters cannot be calculated until the sum of the partial errors computed by each computational node are reduced to a single floating-point value.

As part of the reduction operation, each computational node must communicate its single floating-point partial error value across the communications fabric, which means that latency and small message throughput must scale with the number of computational nodes.

Intel OPA was designed to provide extremely high message rates, especially with small message sizes. It also delivers low fabric latency that remains low at scale. Yawroski notes, “Intel OPA’s low latency, high message rate and high bandwidth architecture are key for machine learning performance”.

Intel OPA’s low latency, high message rate and high bandwidth architecture are key for machine learning performance – Joe Yaworski (Intel Director of Fabric Marketing for the HPC Group)

The Intel OPA performance goals have been proven in practice according to recent benchmark results.

Lower MPI Latency: Compared to EDR InfiniBand, Intel OPA demonstrates lower MPI latency* according to the Ohio State Microbenchmarks.
Figure 6: Image courtesy Intel
Higher MPI message rate: Intel OPA also demonstrates a better (e.g. higher) MPI message rate on the Ohio State Microbenchmarks. The test measurements included one switch hop.

Figure 7: Image courtesy Intel*

Lower latency at scale: Tests results show that Intel OPA demonstrates better latency than EDR at scascale.

Figure 8: Latency as a function of number of nodes** (Image courtesy Intel)

Lower CPU utilization: It’s also important that communication not consume CPU resources that can be used for training and other HPC applications. As shown in the Intel measurements, Intel OPA is far less CPU intensive.

Figure 9: CPU utilization during the osu_mbw_mr message rate test*** (Image courtesy Intel)

In combination, these Intel OPA performance characteristics help to speed machine learning in a distributed cloud or HPC cluster environment. The scalability and increased performance when handling small messages gives Intel OPA an advantage over EDR InfiniBand when calculating reductions.

Summary

This is the second in a multi-part series on machine learning that examines the impact of Intel SSF technology on this valuable HPC field. Intel SSF is designed to help the HPC community design identify the right combinations of technology for machine learning and other HPC applications.

For data transport, the Intel OPA specifications hold exciting implications for machine learning applications as it promises to speed the training of distributed machine learning algorithms through: (a) a 4.6x improvement in small message throughput over the previous generation fabric technology, (b) a 65ns decrease in switch latency (think how all those latencies add up across all the switches in a big network)[3], and (c) by providing a 100 Gb/s network to speed the broadcast of millions of deep learning network parameters to all the nodes in the computational cluster (or cloud) plus minimize startup time when loading large training data sets.

1) See Figure 1 in Transforming the Economics of HPC Fabrics with Intel® Omni-Path Architecture.

2) “How Neural Nets Work,” Neural Information Processing Systems, Proceedings of IEEE 1987 Denver Conference on Neural Networks, A.S. Lapedes, R.M. Farber. (D.Z. Anderson, editor), (1988).

3) https://ramcloud.atlassian.net/wiki/download/attachments/22478857/20150902-IntelOmniPath-WhitePaper_2015-08-26-Intel-OPA-FINAL.pdf.

* Tests performed on Intel® Xeon® Processor E5-2697v3 dual-socket servers with 2133 MHz DDR4 memory. Intel® Turbo Boost Technology and Intel® Hyper Threading technology enabled. Ohio State Micro Benchmarks v. 4.4.1. Intel OPA: Open MPI 1.10.0 with PSM2 as packaged with IFS 10.0.0.0.697. Intel Corporation Device 24f0 – Series 100 HFI ASIC. OPA Switch: Series 100 Edge Switch – 48 port. IOU Non-posted Prefetch disabled in BIOS. EDR: Open MPI 1.10-mellanox released with hpcx-v1.5.370-gcc-MLNX_OFED_LINUX-3.2-1.0.1.1-redhat7.2-x86_64. MLNX_OFED_LINUX-3.2-2.0.0.0 (OFED-3.2-2.0.0). MXM_TLS=self,rc, -mca pml yalla tunings. Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 – 36 Port EDR Infiniband switch. Intel® True Scale: Open MPI. QLG-QLE-7342(A), 288 port True Scale switch. 1. osu_latency 8 B message. 2. osu_bw 1 MB message. 3. osu_mbw_mr, 8 B message (uni-directional), 28 MPI rank pairs. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary.

**Intel® Xeon® processor E5-2697 v3, Intel® Turbo Boost technology enabled. 2133 MHz 64 GB DDR4 RAM per node. 28 MPI ranks per node. HPCC 1.4.3. Open MPI 1.10 for Intel OPA. Mellanox EDR based on internal measurements. Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 – 36 Port EDR Infiniband switch. Open MPI 1.8-mellanox released with hpcx-v1.3.336-icc-MLNX_OFED_LINUX-3.0-1.0.1-redhat6.6-x86_64.tbz. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary.

*** Intel® Xeon® Processor E5-2690v4 dual-socket servers. OSU OMB 4.1.1 osu_mbw_mr with 1-28 ranks per node. Number of iterations at 1MB message size increased enough to allow quasi-steady CPU utilization to be captured using Linux top. CPU utilization is shown as the average on both the send and receive nodes across all participating cores. Example: 1 core 100% busy on a 10 core CPU would be “10% CPU utilization”. Benchmark processes pinned to the cores on the socket that is local to the adapter before using the remote socket. Intel MPI 5.1.2. RHEL 7.1, 2400MHz DDR4 ram (128GB) per node. Intel® OPA: shm:tmi fabric, 100 series HFI and 48 port Edge switch. Mellanox EDR: shm:dapl fabric. Mellanox ConnectX-4 MT27700 HCA and MSB7700-ES2F switch. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary.