“For deep learning to have a meaningful impact and business value, the time to train a model must be reduced from weeks to hours,” observed Ananth Sankaranarayanan, Intel’s director of engineering, analytics and AI solutions.
Demonstrating the performance benefits of Intel Xeon and Intel Xeon Phi hardware and new Intel Architecture (IA) optimized software at the recent Intel HPC Developer Conference, Sankaranarayanan presented image recognition training results showing that a single Intel Xeon Phi processor 7250 ran a Berkeley Vision and Learning Center (BVLC) Caffe (a popular machine learning package) example more than 24x faster when calling the new Intel Math Kernel Library Deep Neural Network (Intel MKL-DNN) primitives. He also presented performance data showing a reduction in training time of 40X when running on a 128-node Xeon Phi processor cluster connected by the Intel Omni-Path networking fabric.
Similarly, Sankaranarayanan showed that a 2P Intel Xeon processor E5-2699 v4 achieved a greater than 17x speedup when running image scoring, or inference, workload. Inference performance is critical to the volume processing of deep learning models in the data center. Sankaranarayanan pointed out, “These performance results are generating very strong interest in Intel Xeon and Intel Xeon Phi processors for machine learning and deep learning using Intel-Caffe and displacing NVIDIA as the only performant solution”. He concluded his presentation by showing that near-native performance can be achieved in Python by calling Intel MKL and using the Intel optimized drop-in replacement Python distribution. These results strongly benefit the numerical Python and Python machine learning communities as they speed popular packages such as NumPy, SciPy, Scikit-Learn, PyTables, Scikit-Image, and more.
The Intel deep learning roadmap
The Intel strategy is to make machine learning more pervasive by enabling deployment-ready solutions through a tiered approach supporting a large, open ecosystem as shown below.
There are five main focus tiers:
- Best in class hardware: The foundation of the roadmap is Intel’s processor technology that supports deep learning training through the use Intel Xeon Phi and Intel Xeon Processors in HPC and commercial data centers, the cloud, and workstations. Figure 1 shows that FPGA’s and the Intel Core processor family currently comprise a middle ground, but we note that Intel did demonstrate deep learning running on an FPGA in their booth at Supercomputing 2016. Thus developing technologies such as the Nervana ASIC and FPGA-based deep learning may move specialized hardware further into the datacenter in the near future. Meanwhile, other Intel technologies – encapsulated in the Intel Scalable System Framework (Intel SSF) – are augmenting Intel processors and software with network technologies (such as Intel Omni-Path architecture), memory technology (with fast memory like MCDRAM) and storage technologies like Intel Solutions for Lustre software, and 3D XPoint (a non-volatile memory that is blurring the lines between memory and storage).
- Free Libraries and Languages: Sankaranarayanan focused on the Intel Math Kernel Library (Intel MKL) and the Intel Data Analytics Acceleration Library (Intel DAAL) in his Intel HPC DevCon talk. Intel MKL is well known as a C/C++ and Fortran callable library which Intel has recently augmented with a number of Deep Neural Network building blocks. The optimized Intel Python distribution makes use of both of these libraries as well as the Intel Threading Building Blocks (Intel TBB) library to run Python applications faster and with more parallelism.
- Optimized Open Frameworks: Most machine learning efforts use open source machine learning toolkits. These frameworks provide a cost savings as they let data scientists and employees focus on their work rather than programming. Intel is working with the open source developers and pushing improvements upstream for better multi-node scaling and to increase performance for everyone in the Python community. Previous published results by customers such as Kyoto University show substantial performance improvements, to the point that even Intel Xeon processors are able outperform K40 (Maxwell generation) GPUs (Source Kyoto University). Sankaranarayanan’s slides also show that an Intel Xeon processor can deliver competitive performance to GPUs.
- Tools and Platforms: Collaboration and security throughout the solution stack are important for any machine learning effort. Sankaranarayanan discussed the Trusted Analytics Platform (TAP), which provides an integrated environment containing tools components and services to optimize performance and security in his talk.
- Solution blueprints: These blueprints provide reference solutions across a wide variety industries from self-driving cars to the cloud and financial industries. The idea is to save time and accelerate customer efforts by leveraging the work of experts. In other words, don’t get caught “reinventing the wheel”.
Intel MKL Benchmarks
The Intel MKL-DNN primitives that were recently added to the Intel MKL library have been release with liberal licensing so Intel MKL and Intel MKL-DNN can be used everywhere. The deep learning primitives speed popular open source deep learning frameworks as shown below. Optimized deep learning primitives include convolution, pooling, normalization, ReLU and inner product.
Sankaranarayanan showed the following benchmark results to illustrate the Intel MKL and Intel MKL-DNN benefits for training on both Intel Xeon and Intel Xeon Phi platforms – up to 24x.
Figure 4: Improved Deep Neural Network training performance using Intel MKL (Source: Intel*)These primitives also benefit inference performance on both processor families – up to 31x.
Scaling is also a key performance characteristic as most deep learning training workloads are too big for a single computational node to deliver sufficiently fast time-to-model solutions.
Of course, the proof of the pudding is in the eating, which is why Sankaranarayanan discussed a LeCloud deep learning case study. LeCloud is used by a leading video cloud provider in China who provides a video detection service. Switching from the open source BVLC Caffe with OpenBlas to the Intel optimized Caffe plus Intel MKL delivered a 30x performance improvement in production training jobs.
The Intel Data Analytics Acceleration Library (Intel DAAL), offers fast ready-to-use higher-level algorithms to speed data analysis and machine learning. These libraries can be called from any big-data framework and use communications schemes like Hadoop and MPI. A key feature of the Intel DAAL library is that it helps the user manage their data so that it has the optimal layout in processor memory, which lets the processor deliver significantly higher performance. The Intel DAAL data management is part of the full end-to-end workflow support provided for machine learning workloads.
Intel encourages contributions to Intel DAAL to improve both performance and functionality. Readers can fork the code at https://github.com/01org/daal.
The Intel Deep Learning SDK
Sankaranarayanan also discussed the high-level Intel Deep Learning SDK for training and inference. The idea is to simplify the installation and provide a visual way to setup, tune, and run deep learning algorithm and to provide end-to-end workflow support for HPC, cloud, and data center users.
This same philosophy is followed in the optimized Intel Python distribution for IA. Information about the Intel high performance Python distribution can be found on a plethora of sites including Intel.com. The Intel distribution can be freely downloaded at https://software.intel.com/en-us/intel-distribution-for-python.
Early Knights Mill information
It was also exciting to see disclosures regarding the potential of the new codename Knights Mill Intel Xeon Phi processor family plus the common Groveport bootable host CPU platform.
Sankaranarayanan’s presentation shows that Intel is fully engaged with the machine and deep learning communities and that they are following a robust roadmap to support artificial intelligence customers. The result, as Sankaranarayanan pointed out with benchmarks and the LeCloud case study, is that Intel processors are displacing NVIDIA as the only performant solution. The expectation is that machine learning will turn into a $70B combined hardware and software market by 2020 as the 2013 market was only $8B (Source: Intel* references data from IDC, IoT market related to Analytics). This exemplifies the significant growth the industry will experience over the next few years Similarly, Intel has engaged with the Python numeric and machine learning communities, which is also experiencing significant growth.
Rob Farber is a global technology consultant and author with an extensive background in HPC and in developing machine learning technology that he applies at national labs and commercial organizations. He was also the editor of Parallel Programming with OpenACC. Rob can be reached at firstname.lastname@example.org.