Sponsored Post

This is the first in a multi-part series on machine-learning that examines the impact of Intel Scalable System Framework (Intel® SSF) technology on the valuable HPC field of deep-learning. In this article we will focus on Intel® Xeon® and Intel® Xeon Phi™ processors. Follow-on articles will discuss other Intel SSF components including networking and storage with a focus on solving big HPC problems such as deep learning.

Interest in deep learning is accelerating as enterprising organizations are realizing just how big an impact it can have across a wide range of markets ranging from Internet search, to social media, to real-time robotics, self-driving vehicles, drones and more. It has helped foster technology innovation spanning the gamut of machine performance, deep learning (and machine learning in general) that encompasses floating-point-, network- and data-intensive ‘training’ plus real-time, low-power ‘prediction’ operations. With the Intel Scalable System Framework, the company is introducing a holistic direction for developing high performance, balanced, power-efficient and reliable systems capable of supporting a wide range of workloads, including deep learning and machine learning in general. Intel SSF represents both the hardware and software plus the outline of how to incorporate yet-to-be-released reference architectures and reference designs.

Figure 1: Intel(R) Scalable System Framework (Image courtesy Intel, click view website)

Training ‘complex multi-layer’ neural networks is referred to as deep-learning as these multi-layer neural architectures interpose many neural processing layers between the input data and the predicted output results – hence the use of the word deep in the deep-learning catchphrase. While the training procedure is computationally expensive, evaluating the resulting trained neural network is not, which explains why trained networks can be extremely valuable as they have the ability to very quickly perform complex, real-world pattern recognition tasks on a variety of low-power devices including – security cameras, mobile phones, wearable technology. These architectures can also be implemented on FPGAs to process information quickly and economically in the data center on low-power devices, or as an alternative architecture on high-power FPGA devices.

The Intel Xeon Phi processor product family is but one part of Intel SSF that will bring machine-learning and HPC computing into the exascale era. Intel’s vision is to help create systems that converge HPC, Big Data, machine learning, and visualization workloads within a common framework that can run in the data center – from smaller workgroup clusters to the world’s largest supercomputers – or in the cloud. Intel SSF incorporates a host of innovative new technologies including Intel® Omni-Path Architecture (Intel® OPA), Intel® Optane™ SSDs built on 3D XPoint™ technology, and new Intel® Silicon Photonics – plus it incorporates Intel’s existing and upcoming compute and storage products, including Intel Xeon processors, Intel Xeon Phi processors, and Intel® Enterprise Edition for Lustre* software.

Computation

Fueled by modern parallel computing technology, it is now possible to train very complex multi-layer neural network architectures on large data sets to an acceptable level of accuracy. These trained networks can perform complex pattern recognition tasks for real-world applications ranging from Internet image search to real-time runtimes in self-driving cars. The high-value, high accuracy recognition (sometimes better than human) trained models have the ability to be deployed nearly everywhere, which explains the recent resurgence in machine-learning, in particular in deep-learning neural networks.

Intel® Xeon Phi™ Processors

The new Intel processors, in particular the upcoming Intel Xeon Phi processor family (formerly code name Knights Landing), promises to set new levels of training performance with multiple per core vector processing units, long vector instructions, high-bandwidth MCDRAM to reduce memory bottlenecks, and Intel Omni-Path Architecture to even more tightly couple large numbers of distributed computational nodes.

Computational nodes based on the next generation Intel Xeon Phi devices promise to accelerate both memory bandwidth and computationally limited machine learning algorithms, because these processors greatly increase floating-point performance while the on-package MCDRAM greatly increases memory bandwidth. In particular, the upcoming Intel Xeon Phi processor contains up to seventy two (72) processing cores where each core contains two AVX-512 vector processing units. The increased floating-point capability will benefit computationally intensive deep-learning neural network workloads.

Intel® Xeon® Processors

The recently announced Intel Xeon processors E5 v4 product family, based upon the “Broadwell” microarchitecture, is another component in the Intel SSF that can help meet the computational requirements of organizations utilizing deep learning. This processor family can deliver up to 47%¹ more performance across a wide range of HPC codes. In particular, improvements to the microarchitecture improve the per-core floating-point performance – particularly for the floating-point multiply and FMA (Fused Multiply-Add) that dominate machine learning runtimes – as well as improve parallel, multi-core efficiency. For deployments, a 47% performance increase means far fewer machines or cloud instances need to be used to handle a large predictive workload such as classifying images or converting speech to text – which ultimately saves both time and money.

Figure 2: Performance on a variety of HPC applications* (Image courtesy Intel)

See the March 31, 2016 article, “How the New Intel Xeon Processors Benefit HPC Applications”, for more information about the floating-point and other thermal, memory interface and virtual memory improvements in these new Intel Xeon processors.

An exascale capable parallel mapping for machine learning

The following figure schematically shows an capable massively parallel mapping for training neural networks. This mapping was initially devised by Farber in the 1980s for the SIMD-based (Single Instruction Multiple Data) CM-1 connection machine. Since then, that same mapping has run on most supercomputer architectures and recently been used to achieve petaflop per second average sustained performance² on both Intel Xeon Phi processor and GPU based leadership class supercomputers. It has been used to run at scale on supercomputers containing tens to hundreds of thousands of computational devices and has been shown to deliver TF/s (teraflop per second) performance per current generation Intel Xeon Phi coprocessors and other devices.

Figure 2: Massively Parallel Mapping (Courtesy TechEnablement)

With this mapping, any numerical nonlinear optimization method that evaluates an objective function that returns an error value (as opposed to an error vector) can be used to adjust the weights of the neural network during the training process. The objective function can be parallelized across all the examples in the training set, which means it maps very efficiently to both vector and SIMD architectures. It also means that the runtime is dominated by the objective function as training to solve complex tasks requires large amounts of data. Each calculation of the partial errors runs independently, which means that this mapping scales extremely well across arbitrarily large numbers of computational devices (e.g. tens to hundreds of thousands of devices). The reduction operation used to calculate the overall error makes this a tightly coupled computation, which means that performance will be limited to that of the slowest computational node (shown in step 2) or by the communications network for the reduction (shown in step 3). It also means that the overall scaling is exascale capable due to the O(log(NumberOfNodes)) behavior of the reduction operation.

For more information see:

Chapter 7, “Deep Learning and Numerical Optimization” in High Performance Parallelism Pearls.
Dobbs, “Exceeding Supercomputer Performance with Intel Phi”.

Computational vs. memory bandwidth bottlenecks

The hardware characteristics of a computational node define which bottleneck will limit performance – specifically if the node will be computationally or memory bandwidth limited. A balanced hardware architecture like that of the Intel Xeon Phi processors will support many machine-learning algorithms and configurations as well as other HPC applications.

Well implemented machine-learning packages structure the training data in memory so it can be read sequentially via streaming read operations. Current processors are so fast that they can process simple objective functions faster than the memory system can stream data. As a result, the processor stalls while waiting for data. Complex objective functions, especially those used for deep-learning, have a high computational latency (meaning they require more floating-point operations), which relieves much of the pressure on the memory subsystem. In addition, asynchronous prefetch operations can be used to hide the cost of accessing memory, so complex deep-learning objective functions tend to be limited by the floating-point capabilities of the hardware.

Computational nodes based on the new Intel Xeon Phi processors promise to accelerate both memory bandwidth and computationally limited machine learning objective functions, because computational nodes based on these processors greatly increase floating-point performance while on package MCDRAM greatly increases memory bandwidth³.

In particular, each Intel Xeon Phi processor contains up to seventy two (72) processing cores where each core contains two Intel AVX-512 vector processing units. The increased floating-point capability will benefit computationally intensive deep-learning neural network architectures. The large machine-learning data sets required for training on complex tasks will ensure that the longer Intel AVX-512 vector instructions will be of benefit.

The new Intel Xeon Phi processor product family will contain up to 16GB (gigabytes) of fast MCDRAM, a stacked memory which, along with hardware prefetch and out-of-order execution capabilities, will speed memory bandwidth limited machine learning and ensure that the vector units remain fully utilized. Further, this on-package memory can by backed by up to six channels of DDR4 memory, with a maximum capacity of . Providing a platform with the capacity of DDR memory and the speed of on package memory.

Convergence and numerical noise

The training of machine-learning algorithms is something of an art – especially when training the most commercially useful nonlinear machine-learning objective functions. Effectively, the processor has to iteratively work towards the solution to a non-linear optimization problem. Floating-point arithmetic is only an approximation, which means that each floating-point arithmetic operation introduces some inaccuracies in the calculation, which is also referred to as numerical noise. This numerical noise, in combination with limited floating-point precision can falsely indicate to the numerical optimization method that there is no path forward to a better, more accurate solution. In other words, the training process can get ‘trapped’ in a local minima and the resulting trained machine-learning algorithm may perform its task poorly and provide badly classified Internet search results or dangerous behavior in a self-driving car.

For complex deep-learning tasks, the training process is so computationally intensive that many projects never stop the training process, but rather use snapshots of the trained model parameters while the training proceeds 24/7 non-stop. Validation efforts are used to check to see if the latest snapshot of the trained network performs better on the desired task. If need be, the training process can then be restarted with additional data to help correct for errors. Google uses this process when training the deep-learning neural networks used in their self-driving cars⁴.

Load balancing across distributed clusters

Large computational clusters can be used to speed training and dramatically increase the usable training set size. Both the ORNL Titan and TACC Stampede leadership class supercomputers are able to sustain petaflop per second performance on machine-learning algorithms⁵. When running in a distributed environment, it is possible to load-balance the data across the computational nodes to achieve the greatest number of training iterations per unit time. The basic idea is to choose a minimum size of the per-node data set (shown in step 2) to minimize the per node runtime while fully utilizing the parallel capabilities of the node so no computational resources are wasted. In this way, the training process can be tuned to a computational cluster to deliver the greatest number of training iterations per unit time without a loss of precision or wasting any computational resources.

A challenge with utilizing multiple TF/s computational nodes like the Intel Xeon Phi processor family is that the training process can become bottlenecked by the reduction operation because the nodes are so fast. The reduction operation is not particularly data intensive, with each result being a single floating-point number. This means that network latency, rather than network bandwidth, limit the performance of the reduction operation . Intel OPA specifications are exciting for machine-learning applications as they promise to speed distributed reductions through: (a) in small message throughput over the previous generation fabric technology⁶, (b) a in switch latency (think how all those latencies add up across all the switches in a big network), and (c) some of the new Intel Xeon Phi family of processors will incorporate an on-chip Intel® OPA interface which may help to reduce latency even further.

From a bandwidth perspective, Intel OPA also promises a 100 Gb/s bandwidth, which will greatly help speed the broadcast of millions of deep-learning network parameters to all the nodes in the computational cluster plus minimize startup time when loading large training data sets. In particular, we look forward to seeing the performance of Lustre* over Intel OPA for large computational clusters, which means very large training runs can be started/restarted with little overhead.

Peak performance and power consumption

Machine-learning algorithms are heavily dependent on the fused multiply-add (FMA) performance to perform dot products. In fact, linear neural networks such as those that perform a PCA (Principle Components Analysis) . One example can be found in the PCA teaching code in the farbopt github repository. Thus any optimizations that can increase the performance of the FMA instruction will benefit machine-learning algorithms be they on a CPU, GPU, or coprocessor.

These FMA dominated linear networks are of particular interest as they can show how close actual code can come to achieving peak theoretical performance on a device, as the FMA instruction is counted as two floating-point operations per instruction. These FMA dominated codes can also highlight bottlenecks within the processor. For example, GPU performance tends to fall off rapidly once the internal register file is exhausted, because register memory is the only memory fast enough to support peak performance. As a result, register file limitations can be a performance issue when training large, complex deep-learning neural networks.

Of course heat generation and power consumption are also critical metrics to consider – especially when evaluating a hardware platform that will be training machine-learning algorithms in a 24/7 environment. At the moment, both the current generation Intel Xeon Phi coprocessors and GPUs deliver multi-teraflop per second performance in a PCIe bus form factor. It will be interesting to measure the flop per watt ratio of the new Intel Xeon Phi processors as they run near peak performance when training both linear and nonlinear neural networks.

Upcoming articles

This is the first in a multi-part series on machine-learning that examines the impact of Intel SSF technology on this valuable HPC field. Intel SSF is designed to help the HPC community design identify the right combinations of technology for machine learning and other HPC applications. Upcoming articles will discuss more about Intel SSF components and their application to machine learning. Teasers include:

Network: For data transport, the Intel OPA specifications hold exciting implications for machine-learning applications as it promises to speed the training of distributed machine learning algorithms through and (c) by providing a 100 Gb/s network to speed the broadcast of millions of deep-learning network parameters to all the nodes in the computational cluster (or cloud) plus minimize startup time when loading large training data sets.
Storage: Of course, Intel SSF . The Intel Enterprise Edition for Lustre* software is intended to help data scientists (and HPC scientists in general) manage their data in a global storage namespace plus utilize both storage and network hardware resources for high-performance and high-reliability so those same data scientists can always access their data quickly. Other products such as Intel® Parallel Studio XE 2016 helps programmers boost application performance by taking advantage of the ever-increasing processor core count and vector register width available in Intel Xeon processors, Intel Xeon Phi devices.

¹ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Results based on Intel® internal measurements as of February 29, 2016.

² http://www.techenablement.com/deep-learning-teaching-code-achieves-13-pfs-on-the-ornl-titan-supercomputer/

³ Vector Peak Perf: 3+TF DP and 6+TF SP Flops Scalar Perf: ~3x over Knights Corner Streams Triad (GB/s): MCDRAM : 400+; DDR: 90+; See slide 4 http://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.25-Tuesday-Epub/HC27.25.70-Processors-Epub/HC27.25.710-Knights-Landing-Sodani-Intel.pdf

⁴ http://www.scientificcomputing.com/articles/2016/02/deep-learning-revitalizes-neural-networks-match-or-beat-humans-complex-tasks

⁵ http://insidehpc.com/2013/06/tutorial-on-scaling-to-petaflops-with-intel-xeon-phi/

⁶https://ramcloud.atlassian.net/wiki/download/attachments/22478857/20150902-IntelOmniPath-WhitePaper_2015-08-26-Intel-OPA-FINAL.pdf