Recent benchmarks from respected sources such as Kyoto University and Colfax Research show Intel Scalable System Framework (Intel SSF) balanced technologies approach coupled with optimized software including Python, Theano, and Torch can provide a tremendous performance benefit for deep learning and more generally both machine learning and generic HPC applications. The performance increases are the result of code modernization efforts that focused on: (a) increasing parallelism, (b) efficiently utilizing CPU vector units (including making use of high-bandwidth Intel Xeon Phi processor memory when available) and (c) efficiently utilizing Intel SSF technologies such as MCDRAM and communications libraries like Intel’s Message Passing Interface (MPI) for high-bandwidth, low-latency distributed computations.
The important take-away message from this article is that code modernization can deliver significant performance benefits deep learning and machine learning algorithms without having to change the high-level source code for both training and inference. Instead the optimizations are performed in generated code or the software middleware.
Code modernization can deliver significant performance benefits deep learning and machine learning algorithms without having to change the high-level source code for both training and inference.
In particular, this article will focus on the implications of code modernization on the open source Lua/Torch based NeuralTalk2 image tagging framework as well as the Python-based Theano framework. These provide example training and inference (alternatively called prediction or scoring) speedups:
- Training: The Kyoto University/Intel collaboration focused on optimizing the Theano code generation for CPU-based target architectures that achieved an 8.78x speedup and resulted in an Intel Xeon dual-socket chipset outperforming an NVIDIA K40 GPU.
- Inference: The Colfax Research code modernization focused on optimizing the Torch middleware, leaving the high-level Lua scripts essentially untouched (i.e. only one source line was change) yet achieved a 28x speedup in inference performance when running on an Intel Xeon E5-2750 v4 dual socket CPU and a 55x speedup when utilizing a new Intel Xeon Phi processor. (Colfax Research notes Intel Xeon Phi can deliver even greater speedups, but that would have required expressing more parallelism in the high-level source code.)
The importance of the Colfax Research inference results can be appreciated through Diane Bryant’s (Executive VP and General Manager of the Data Center Group, Intel) observation that, “The Intel Xeon processor E5 family is the most widely deployed processor for machine learning inference, with the added flexibility to run a wide variety of data center workloads”.  Such a dramatic speedup (e.g. 28x) on the most widely deployed hardware platform should make code modernization a “must examine” action item for anyone who uses inference for commercial, scientific, and data analytic applications.
The Intel Xeon processor E5 family is the most widely deployed processor for machine learning inference, with the added flexibility to run a wide variety of data center workloads – Diane Bryant
The Colfax Research case study
Colfax research created a proof-of-concept implementation of a highly optimized machine learning application for Intel Architecture using the NeuralTalk2 example.
NeuralTalk2 uses machine learning to analyze real-life photographs of complex scenes and produce a verbal description of the objects in the scene and relationships between them (e.g., “a cat is sitting on a couch”, “woman is holding a cell phone in her hand”, “a horse-drawn carriage is moving through a field”, etc.) Written in Lua, it uses the Torch machine learning framework.
The optimizations Colfax Research performed included:
- Rebuilding the Torch library using the Intel® compiler.
- Performing code modernization like batch GEMM operations and incorporating some algorithmic changes.
- Improving parallelization and pinning threads to cores.
- Taking advantage of Intel Xeon Phi processor high-speed memory.
In the case study, presented at ISC’16, Colfax observed performance improvements of 55x on Intel Xeon Phi processors and performance gains of 28x on Intel Xeon processors v4 (Broadwell) as shown in figure 1.
The Kyoto experience
The Kyoto University Graduate School of Medicine is applying various machine learning and deep learning algorithms to problems in life sciences including drug discovery, medicine, and health care. They recognized that the performance of the open source Theano C++ multi-core code could be significantly improved. Theano is a Python library that lets researchers transparently run deep learning models on CPUs and GPUs. It does so by generating C++ code from the Python script for both CPU and GPU architectures. Intel has optimized the popular Theano deep learning framework which Kyoto uses to perform computational drug discovery and machine-learning based big data analytics. For historical reasons, many software packages like Theano lacked optimized multicore code as all the open source effort had been put into optimizing the GPU code paths.
Two Kyoto benchmarks are shown below using the hardware that was available at the time. These benchmarks demonstrate that a dual-socket Intel Xeon processor E5 (formerly known as Haswell) can outperform an NVIDIA K40 GPU on a large Deep Belief Network (DBN) benchmark implemented via the popular Theano machine-learning package. The key message in the first figure is the speedup achieved by code optimization (the two middle bars) for two problem sizes over unoptimized code (leftmost pair of bars). The rightmost pair of bars shows that an Intel Xeon processor outperforms an NVIDIA K40 GPU when running the optimized version. It is expected that the newest Intel Xeon Phi processors will deliver significantly faster performance than Intel Xeon CPU E5-2699 v3 chipset. The second figure shows both performance and capacity improvements.
We look forward to seeing updated results from the Kyoto team as it is highly likely that results on the newest Intel Xeon Phi processors will be even faster.
Big data stresses both network and storage
The combination of Intel SSF technologies is key to fast time-to-model performance as big data is key to accurately training neural networks to solve complex problems. The paper How Neural Networks Work, demonstrated that the neural network is actually fitting a ‘bumpy’ multi-dimensional surface, which means that lots of training data is required to specify the hills and valleys, or points of inflection, on the multidimensional surface that is to be fitted. Not surprisingly, big, complex data sets can make the preprocessing of the training data as complex a computational problem as the training itself – especially when extracting the information from unstructured data.
Big data means that parallel distributed computing (illustrated by the mapping below) is a necessary challenge for machine learning as even the TF/s parallelism of a single Intel Xeon and Intel Xeon Phi processor-based workstation is simply not sufficient to accurately train in a reasonable time on many complex data sets. Instead, numerous computational nodes must be connected together via high-performance, low latency communications fabrics like Intel Omni-Path Architecture (Intel OPA) and Intel Message Passing Interface (Intel MPI).
The Intel MPI library
MPI is a key communications layer for many scientific and commercial applications including machine and deep learning applications. In general, all distributed communications pass through the MPI API (Application Programming Interface), which means compliance and performance at scale are both critical.
The Intel MPI library provides programmers a “drop-in” MPICH replacement library that can deliver the performance benefits of the Intel OPA communications fabric plus high core count Intel Xeon and Intel Xeon Phi processors. Tests have verified the scalability of the Intel MPI implementation to 340,000 MPI ranks  where a rank is a separate MPI process that can run on a single core or an individual system. Other communications fabrics such as InfiniBand are supported plus programmers can recompile their applications to use the Intel MPI library.
As shown in Figure 4, the global broadcast of parameters to the computation node is a performance critical operation. The following graph shows how the Intel MPI team has achieved an 18.24x improvement over OpenMPI.
Machine learning is but one example of a tightly coupled distributed computation where the small message traffic generated by a distributed network reduction operation can have a big impact on application performance. The 1.34x performance improvement shown below translates to a significant time-to-model improvement simply by “dropping in” the Intel MPI library for MPICH compatible binaries (or simply recompile to transition from non-MPICH libraries like OpenMPI).
Such reduction operations are common in HPC codes, which is one of the reasons why people spend large amounts of money on the communications fabric which can account for up to 30% of the cost of a new machine. Increased scalability and performance at a lower price point explains the importance of Intel OPA to the HPC and machine learning communities as well as the cloud computing community.
Intel Omni-Path Architecture
Intel MPI can work with a variety of communications fabrics. For data transport, the Intel OPA specifications hold exciting implications for machine-learning applications as it promises to speed the training of distributed machine learning algorithms through: (a) a 4.6x improvement in small message throughput over the previous generation fabric technology, (b) a 65ns decrease in switch latency (think how all those latencies add up across all the switches in a big network), and (c) by providing a 100 Gb/s network bandwidth  to speed the broadcast of millions of deep-learning network parameters to all the nodes in the computational cluster (or cloud) plus minimize startup time when loading large training data sets.
The Lustre filesystem for storage
Succinctly, machine learning and other data-intensive HPC workloads cannot scale unless the storage filesystem can scale to meet the increased demands for data. This includes the heavy demands imposed by data preprocessing for machine learning (as well as other HPC problems) as well as the fast load of large data sets during restart operations. These requirements make Lustre – the de facto high-performance filesystem – a core component in any machine learning framework. The Intel® Enterprise Edition for Lustre software, backed by expert support, brings the power and scalability of Lustre to the enterprise platform.
Modern software is a requirement
To assist others in performing fair benchmarks and to realize the benefits of multi- and many-core performance, Intel recently announced several optimized libraries for deep and machine learning such as the high-level Intel Data Analytics Acceleration Library (Intel DAAL) and lower level Intel Math Kernel Library for Deep Neural Network (Intel MKL-DNN) that provide optimized deep learning primitives. The Intel MKL-DNN announcement also noted the library is open source, has no restrictions, and is royalty free. Even the well-established Intel Math Kernel Library (Intel MKL) is getting a machine learning refresh with the addition of optimized primitives to speed machine and deep learning on Intel architectures. More about these libraries can be seen in the Faster Machine Learning and Data Analytics Using Intel Performance Libraries video.
Summary: Common themes for code modernization
The techniques that provided the bulk of the performance improvements (e.g. increasing parallelism, efficiently utilizing vectorization, making use of faster MCDRAM performance, and distributing computations with MPI) have also delivered significant performance improvements on both Intel Xeon and Intel Xeon Phi processors for a number of other code modernization projects as well. Many additional examples aside from the two mentioned here can be found in the High Performance Parallelism Pearls series (vol1 and vol2) as well as the just published, Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition. Each of these books provide detailed code analysis, benchmarks, and working code examples spanning a wide variety of application areas to help developers achieve success in their own code modernization projects.
Developers should also checkout the extensive technical material and training around Code Modernization at the Intel Developer Zone: https://software.intel.com/modern-code. In addition, see how to apply Machine Learning algorithms to achieve faster training of deep neural networks https://software.intel.com/machine-learning.
For more information about the Intel optimized Python, see the TechEnablement article, Up To Orders of Magnitude More Performance with Intel’s Distribution of Python.
Intel also has a number of webinars scheduled covering many of the topics covered in this article. Readers can see the schedule and register at https://software.intel.com/en-us/events/development-tools-webinars.