• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / Featured article / Accelerating Python and Deep Learning

Accelerating Python and Deep Learning

February 20, 2017 by Rob Farber Leave a Comment

Sponsored Content

“For deep learning to have a meaningful impact and business value, the time to train a model must be reduced from weeks to hours,” observed Ananth Sankaranarayanan, Intel’s director of engineering, analytics and AI solutions.

Demonstrating the performance benefits of Intel Xeon and Intel Xeon Phi hardware and new Intel Architecture (IA) optimized software at the recent Intel HPC Developer Conference, Sankaranarayanan presented image recognition training results showing that a single Intel Xeon Phi processor 7250 ran a Berkeley Vision and Learning Center (BVLC) Caffe (a popular machine learning package) example more than 24x faster when calling the new Intel Math Kernel Library Deep Neural Network (Intel MKL-DNN) primitives. He also presented performance data showing a reduction in training time of 40X when running on a 128-node Xeon Phi processor cluster connected by the Intel Omni-Path networking fabric.

Similarly, Sankaranarayanan showed that a 2P Intel Xeon processor E5-2699 v4 achieved a greater than 17x speedup when running image scoring, or inference, workload. Inference performance is critical to the volume processing of deep learning models in the data center. Sankaranarayanan pointed out, “These performance results are generating very strong interest in Intel Xeon and Intel Xeon Phi processors for machine learning and deep learning using Intel-Caffe and displacing NVIDIA as the only performant solution”. He concluded his presentation by showing that near-native performance can be achieved in Python by calling Intel MKL and using the Intel optimized drop-in replacement Python distribution. These results strongly benefit the numerical Python and Python machine learning communities as they speed popular packages such as NumPy, SciPy, Scikit-Learn, PyTables, Scikit-Image, and more.

The Intel deep learning roadmap

The Intel strategy is to make machine learning more pervasive by enabling deployment-ready solutions through a tiered approach supporting a large, open ecosystem as shown below.

Figure 1: Intel machine learning roadmap

There are five main focus tiers:

  • Best in class hardware: The foundation of the roadmap is Intel’s processor technology that supports deep learning training through the use Intel Xeon Phi and Intel Xeon Processors in HPC and commercial data centers, the cloud, and workstations. Figure 1 shows that FPGA’s and the Intel Core processor family currently comprise a middle ground, but we note that Intel did demonstrate deep learning running on an FPGA in their booth at Supercomputing 2016. Thus developing technologies such as the Nervana ASIC and FPGA-based deep learning may move specialized hardware further into the datacenter in the near future. Meanwhile, other Intel technologies – encapsulated in the Intel Scalable System Framework (Intel SSF) – are augmenting Intel processors and software with network technologies (such as Intel Omni-Path architecture), memory technology (with fast memory like MCDRAM) and storage technologies like Intel Solutions for Lustre software, and 3D XPoint (a non-volatile memory that is blurring the lines between memory and storage).
  • Free Libraries and Languages: Sankaranarayanan focused on the Intel Math Kernel Library (Intel MKL) and the Intel Data Analytics Acceleration Library (Intel DAAL) in his Intel HPC DevCon talk. Intel MKL is well known as a C/C++ and Fortran callable library which Intel has recently augmented with a number of Deep Neural Network building blocks. The optimized Intel Python distribution makes use of both of these libraries as well as the Intel Threading Building Blocks (Intel TBB) library to run Python applications faster and with more parallelism.
  • Optimized Open Frameworks: Most machine learning efforts use open source machine learning toolkits. These frameworks provide a cost savings as they let data scientists and employees focus on their work rather than programming. Intel is working with the open source developers and pushing improvements upstream for better multi-node scaling and to increase performance for everyone in the Python community. Previous published results by customers such as Kyoto University show substantial performance improvements, to the point that even Intel Xeon processors are able outperform K40 (Maxwell generation) GPUs (Source Kyoto University). Sankaranarayanan’s slides also show that an Intel Xeon processor can deliver competitive performance to GPUs.

Figure 2: Comparative performance of Intel Xeon vs GPUs (Source Intel*)

  • Tools and Platforms: Collaboration and security throughout the solution stack are important for any machine learning effort. Sankaranarayanan discussed the Trusted Analytics Platform (TAP), which provides an integrated environment containing tools components and services to optimize performance and security in his talk.
  • Solution blueprints: These blueprints provide reference solutions across a wide variety industries from self-driving cars to the cloud and financial industries. The idea is to save time and accelerate customer efforts by leveraging the work of experts. In other words, don’t get caught “reinventing the wheel”.

Intel MKL Benchmarks

The Intel MKL-DNN primitives that were recently added to the Intel MKL library have been release with liberal licensing so Intel MKL and Intel MKL-DNN can be used everywhere. The deep learning primitives speed popular open source deep learning frameworks as shown below. Optimized deep learning primitives include convolution, pooling, normalization, ReLU and inner product.

Figure 3: Intel MKL overview (Source: Intel)

Sankaranarayanan showed the following benchmark results to illustrate the Intel MKL and Intel MKL-DNN benefits for training on both Intel Xeon and Intel Xeon Phi platforms – up to 24x.

Figure 4: Improved Deep Neural Network training performance using Intel MKL (Source: Intel*)These primitives also benefit inference performance on both processor families – up to 31x.

Figure 5:  Improved Deep Neural Network inference performance using Intel MKL (Source: Intel*)

Scaling is also a key performance characteristic as most deep learning training workloads are too big for a single computational node to deliver sufficiently fast time-to-model solutions.

Figure 6: Multi-Node scaling using Intel Omni-Path on an AlexNet benchmark (Source: Intel*)

Of course, the proof of the pudding is in the eating, which is why Sankaranarayanan discussed a LeCloud deep learning case study. LeCloud is used by a leading video cloud provider in China who provides a video detection service. Switching from the open source BVLC Caffe with OpenBlas to the Intel optimized Caffe plus Intel MKL delivered a 30x performance improvement in production training jobs.

Intel DAAL

The Intel Data Analytics Acceleration Library (Intel DAAL), offers fast ready-to-use higher-level algorithms to speed data analysis and machine learning. These libraries can be called from any big-data framework and use communications schemes like Hadoop and MPI. A key feature of the Intel DAAL library is that it helps the user manage their data so that it has the optimal layout in processor memory, which lets the processor deliver significantly higher performance. The Intel DAAL data management is part of the full end-to-end workflow support provided for machine learning workloads.

Figure 7: Intel DAAL components

Intel encourages contributions to Intel DAAL to improve both performance and functionality. Readers can fork the code at https://github.com/01org/daal.

The Intel Deep Learning SDK

Sankaranarayanan also discussed the high-level Intel Deep Learning SDK for training and inference. The idea is to simplify the installation and provide a visual way to setup, tune, and run deep learning algorithm and to provide end-to-end workflow support for HPC, cloud, and data center users.

Figure 8: Intel Deep Learning SDK overview (Source: Intel*)

This same philosophy is followed in the optimized Intel Python distribution for IA. Information about the Intel high performance Python distribution can be found on a plethora of sites including Intel.com. The Intel distribution can be freely downloaded at https://software.intel.com/en-us/intel-distribution-for-python.

Early Knights Mill information

It was also exciting to see disclosures regarding the potential of the new codename Knights Mill Intel Xeon Phi processor family plus the common Groveport bootable host CPU platform.

Figure 9: Early Knights Mill next generation Intel Xeon Phi processor (Source: Intel*)

Summary

Sankaranarayanan’s presentation shows that Intel is fully engaged with the machine and deep learning communities and that they are following a robust roadmap to support artificial intelligence customers. The result, as Sankaranarayanan pointed out with benchmarks and the LeCloud case study, is that Intel processors are displacing NVIDIA as the only performant solution. The expectation is that machine learning will turn into a $70B combined hardware and software market by 2020 as the 2013 market was only $8B (Source: Intel* references data from IDC, IoT market related to Analytics). This exemplifies the significant growth the industry will experience over the next few years Similarly, Intel has engaged with the Python numeric and machine learning communities, which is also experiencing significant growth.

Rob Farber is a global technology consultant and author with an extensive background in HPC and in developing machine learning technology that he applies at national labs and commercial organizations. He was also the editor of Parallel Programming with OpenACC. Rob can be reached at info@techenablement.com.

*Full legal disclosures as well as benchmark configurations can be found in Sankaranarayanan’s talk, “Accelerating Machine Learning Software on IA” (pdf, video)

Share this:

  • Twitter

Filed Under: Featured article, Featured news, News, Xeon Phi Tagged With: HPC, machine-learning

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Face It: AI Gets Personal to Make You Look Better!
  • Run CUDA without Recompilation on x86, AMD GPUs, and Intel Xeon Phi with gpuOcelot
  • Inside NVIDIA's Unified Memory: Multi-GPU Limitations and the Need for a cudaMadvise API Call
  • Shared Memory is Simple on Intel Xeon Phi - supports STL!
  • The CUDA Thrust API Now Supports Streams and Concurrent Tasks

Archives

© 2023 · techenablement.com