• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / News / Up To Orders of Magnitude More Performance with Intel’s Distribution of Python

Up To Orders of Magnitude More Performance with Intel’s Distribution of Python

August 17, 2016 by Rob Farber Leave a Comment

Intel has created a freely downloadable, optimized Python distribution that can greatly accelerate Python codes. Benchmarks show that two order of magnitude speedups (over 100x) can be achieved by using the Intel Distribution for Python. The Intel® Distribution for Python 2017 Beta program (product release will be in September) provides free access to this optimized version. Intel Python delivers performance improvements for NumPy/SciPy through linking with performance libraries like Intel® Math Kernel Library (Intel® MKL), Intel® Message Passing Interface (Intel® MPI), Intel® Threading Building Blocks (Intel® TBB) and Intel® Data Analytics Library (Intel® DAAL). Intel Python supports Python version 2.7 and 3.5 running on Windows*, Linux* and Mac* OS.

Python is a powerful and popular scripting language that provides fast and fundamental tools for scientific computing through numeric libraries such as NumPy and SciPy. Not just for developers, Intel Python offers support for advanced analytics, numerical computing, just-in-time compilation, profiling, parallelism, interactive visualization, collaboration and other analytic needs. Intel Python is based on Continuum Analytics Anaconda distribution, allowing users to install Intel Python into their Anaconda environment as well as to use Intel Python with the Conda packages on Anaconda.org. Similarly, Python packages such as Theano provide significant machine and deep learning speedups.

Intel_python_8_16_16_fig01

Figure 1: Popularity of coding languages in 2016 (Image courtesy codeeval.com)

Michele Chamber, VP of Products & CMO at Continuum Analytics said, “Python is the defacto data science language that everyone from elementary to graduate school is using because it’s so easy to get started and powerful enough to drive highly complex analytics.”

Python is the defacto data science language that everyone from elementary to graduate school is using because it’s so easy to get started and powerful enough to drive highly complex analytics – Michele Chamber VP and CMO at Continuum Analytics

With over 3 million users. The Anaconda eco-system is large and growing. For example, Anaconda powers Python for Microsoft’s Azure ML platform and Continuum recently partnered with Cloudera on a certified Cloudera parcel, which will help bring optimized Intel Python to the cloud.

Two orders of magnitude speedups

Popular in the scientific and commercial communities, Python applications can be interpreted or compiled. Traditionally, high performance can be achieved by calling optimized native libraries. Continuum Analytics, for example, notes that customers have experienced up to 100x performance increases by using Intel MKL.

Optimized for individual Intel Architectures (IA), significant speedups can be achieved on both Intel Xeon and Intel Xeon Phi processor platforms.

Pure interpreted Python is slow relative to an implementation that calls a native library, in this case NumPy, while Intel optimized multithreaded Python provides additional performance (see figure 2) for a wide range of algorithms (figure 3) by leveraging SIMD and multicore.

Intel_python_8_16_16_fig02

Figure 2: Speedup on a 96-core (with Hyperthreading ON) Intel Xeon processor E5-4657L v2 2.40 GHz over a pure Python example using the default Python distribution (Results courtesy Intel)[i]

Intel_python_8_16_16_fig03

Figure 3: Python performance increase on several numerical algorithms (Image courtesy Intel)[ii]

Intel Xeon Phi processor performance and benefits

Colfax Research shows up to a 154x speedup over standard Python when running on the latest 2nd generation Intel Xeon Phi processors.

Intel_python_8_16_16_fig04

Figure 4: Speedup through the use of MKL and MCDRAM on a 2nd generation Intel Xeon Phi processor (Image courtesy Colfax Research)

The speedup was achieved using default settings without any special tuning. Thus the performance corresponds to what a normal Python user would experience with one exception: the entire application was placed in the high-speed near-memory (also called MCDRAM or High-bandwidth memory) on the 64-core Intel Xeon Phi processor 7210. The Intel Xeon Phi processor near-memory was used in flat mode, i.e., exposed to the programmer as addressable memory in a separate NUMA node, which simply required placing a numactl command in-front of the executable Python script. (MCDRAM as High-Bandwidth Memory in Knights Landing Processors: Developer’s Guide  from Colfax provides more information about Intel Xeon Phi processor memory modes.)

C++
1
$ numactl -m 1 benchmark-script.py

Figure 5: Command to run in high-bandwidth memory on the Intel Xeon Phi Processor

This Python benchmark achieved 1.85 TFlop/s double-precision performance. This high flop-rate reflects 70% of the theoretical peak 64-bit arithmetic performance for the Intel Xeon Phi processor 7210. Colfax notes that during profiling they say that the vector capability of the Intel Xeon Phi processor is fast enough that the MKL time to perform the computation actually took less time than the Python call to the MKL library for many smaller problems.

On the basis of their benchmark results, Colfax Research notes, “the usage of Intel MKL remains crucial for extracting the best performance out of Intel architecture”.

The Intel Xeon Phi near-memory benefits all memory bound applications. For example, sparse matrix libraries (which can be called from Python) typically cause performance issues because they are memory intensive and they tend to hop around in memory, which means they tend to be both access latency and memory bandwidth bound.  “Our sparse linear algebra library greatly benefits from the massive bandwidth of the Intel Xeon Phi processor. Thanks to MKL’s cross-platform compatibility, we were able to port it in a matter of hours,” stated Mauricio Hanzich of the Barcelona Supercomputing Center’s HPC Software Engineering Group. “Xeon Phi’s vector registers along with its massive memory bandwidth are just the perfect combination for finite differences schemes.” Most machine learning algorithms will benefit from the higher bandwidth during training plus sparse matrix operations are useful for data preprocessing.

Xeon Phi processor’s vectorial registers along with its massive memory bandwidth are just the perfect combination for finite differences schemes – Mauricio Hanzich, Barcelona Supercomputing Center

The performance on traditionally memory unfriendly algorithms such as LU decomposition, Cholesky Decomposition, and SVD (Singular Value Decomposition) are shown in the following three figures. Performance speedups for the Intel optimized Python range from 7x – 29x.(Source: Colfax Research.)

Intel_python_8_16_16_fig06

Figure 6: LU Decomposition speedup on a 64-core Intel Xeon Phi processor 7210 (Image courtesy Colfax Research)

Intel_python_8_16_16_fig07

Figure 7: Cholesky Decomposition speedup on a 64-core Intel Xeon Phi processor 7210 (Image courtesy Colfax Research)

Intel_python_8_16_16_fig08

Figure 8: Singular Value Decomposition speedup on a 64-core Intel Xeon Phi processor 7210 (Image courtesy Colfax Research)

Intel Optimized Python for deep and machine learning

PyDAAL provides Python interfaces to the Intel® Data Analytics Acceleration Library (DAAL).

Intel DAAL is an IA-optimized library that provides building blocks for the data analytics stages from data preparation to data mining and machine learning. A standalone community license version is available for free download, or as part of Intel Parallel Studio XE. (The Community Licensing for Intel Performance Libraries license means the library is free for anyone who registers, with no royalties, and no restrictions on company or project size.)

The library is composed of a series of building blocks that can be combined in a performant and scalable fashion across a wide range of Intel processors and be integrated into big data analytic workflows. It can handle computations that are too big to fit into memory using out-of-core algorithms. There are three supported processing modes:

  • Batch processing – Processes all the data at once via in-memory algorithms.
  • Online processing (also called Streaming) – Utilizes out-of-core algorithms that processes data in chunks and incrementally updates the partial result which then is used to create the final result.
  • Distributed processing – Using a model similar to MapReduce, Consumers in a cluster process local data (map stage), after which a Producer process collects and combines partial results from the consumers (reduce stage). The communications functions can be completely written by the developer, which gives developers the ability to adapt the library to a variety of communications frameworks such as Hadoop or Spark, or via explicitly coded communications using a framework like MPI.
Figure 9: General data modeling with Intel DAAL

Figure 9: General data modeling with Intel DAAL

Figure 10: Data transformation and Analysis in Intel DAAL (Image courtesy Intel)

Figure 10: Data transformation and Analysis in Intel DAAL (Image courtesy Intel)

Theano is a Python library that lets researchers transparently run deep learning models on CPUs and GPUs. It does so by generating C++ code from the Python script for both CPU and GPU architectures. Intel has optimized the popular Theano deep learning framework is used to perform computational drug discovery and machine-learning based big data analytics.

The following graph shows the benchmark results for two DBN (Deep Belief Network) neural network configurations (e.g. architectures). The optimized code delivers an 8.78x performance improvement for the larger DBN containing 2,000 hidden neurons over the original open source implementation. These results also show that a dual-socket Intel® Xeon E5-2699v3 (Haswell architecture) chipset delivers a 1.72x performance improvement over an NVIDIA K40 GPU using 16-bit arithmetic (which can double GPU memory bandwidth).

Figure 11: Original vs optimized performance relative to a GPU. The middle bar is the optimized performance (Higher is better) (Results courtesy Intel)

Figure 11: Original vs optimized performance relative to a GPU. The middle bar is the optimized performance (Higher is better) (Results courtesy Intel)

Figure 12: Speedup of optimized Theano relative to GPU plus impact of the larger Intel Xeon memory capacity. (Results courtesy Kyoto University)

Figure 12: Speedup of optimized Theano relative to GPU plus impact of the larger Intel Xeon memory capacity. (Results courtesy Kyoto University)

Optimized threading for Python using Intel TBB

The Intel optimized Python also utilizes Intel TBB to provide a more efficient threading model for multi-core processors.

It’s quite easy to try. Simply pass the –m Intel TBB argument to Python:

C++
1
$ python -m TBB <your>.py

Figure 13: Python command-line invocation that will use TBB

Intel TBB can accelerate programs by avoiding inefficient threads allocation (called oversubscription) when there are more software threads utilized than available hardware resources.

The following complete example demonstrates how simple it is to use Intel TBB in Python. Intel says this example runs 50% faster depending on Intel Xeon processor configuration and number of processors.

Python
1
2
3
4
5
6
7
8
9
10
11
import dask, time
import dask.array as da
 
x = da.random.random((100000, 2000), chunks=(10000, 2000))
t0 = time.time()
 
q, r = da.linalg.qr(x)
test = da.all(da.isclose(x, q.dot(r)))
assert(test.compute()) # compute(get=dask.threaded.get) by default
 
print(time.time() - t0)

Figure 14: A complete Python thread oversubscriptop example that uses Intel TBB (Example courtesy Intel)

Dask is a flexible parallel computing library for analytics. This example splits an array into chunks and performs a QR decomposition using multiple threads in parallel. Each Dask task processes a chunk via a multi-threaded Intel MKL call. Thus, this simple example actually implements a complex multi-threaded nested-parallelism QR decomposition that oversubscribes the available hardware resources.

Optimized MPI for Python using Intel MPI

MPI (Message Passing Interface) is the de facto standard distributed communications framework for scientific and commercial parallel distributed computing. The Intel® MPI implementation is a core technology in Intel® Scalable System Framework that provides programmers a “drop-in” MPICH replacement library that can deliver the performance benefits of the Intel® Omni-Path Architecture (Intel® OPA ) communications fabric plus high core count Intel® Xeon and Intel® Xeon Phi™ processors. “Drop-in” literally means that programmers can set an environmental variable to dynamically load the highly tuned and optimized Intel MPI library – no recompilation required!

The Intel MPI team has spent a significant amount of time tuning the Intel MPI library to different processor families plus network types and topologies.  For example, shared memory is particularly important on high core count processors as data can be shared between cores without the need for a copy operation. DMA mapped memory and RDMA (Remote Direct Memory Access) operations are also utilized to prevent excess data movement. Only when required are optimized memory copy operations utilized depending on the processor uArch. Special support is also provided for the latest 2nd generation Intel Xeon Phi processor (codename Knights Landing).

Optimized random number generation

The latest version of the Intel® Distribution for Python* 2017 Beta introduces numpy.random_intel, an extension to numpy which closely mirrors the design of numpy.random and uses Intel MKL’s vector statistics library to achieve significant performance boost.  Simply replace numpy.random with numpy.random_intel to realize a significant performance boost (up to 57.44x) in random number generation performance (Source: Intel Corporation). Random number generation can be a bottleneck in Monte Carlo and other statistical methods.

Total timing of sampling of 100,000 variates repeated 256 times
Distribution timing(random) in secs timing(random_intel) in secs speedup factor
uniform(-1, 1) 0.357 0.034 10.52
normal(0, 1) 0.834 0.081 10.35
gamma(5.2, 1) 1.399 0.267 5.25
beta(0.7, 2.5) 3.677 0.556 6.61
randint(0, 100) 0.228 0.053 4.33
poisson(7.6) 2.990 0.052 57.44
hypergeometric(214, 97, 83) 11.353 0.517 21.96

Figure 15: Random number performance speedups (Results courtesy Intel)

Profiling

Intel Vtune also supports profiling Python programs. Figure 16 shows an example screenshot of the profile information available for a Python application that uses OpenMP to distribute a matrix multiplication across nodes that also uses MKL multi-threading to parallelize and vectorize the computation within a node.

Intel_python_8_16_16_fig16

Figure 16: Vtune screenshot

Summary

The optimized Intel Python distribution is available to try. Current benchmark results indicate it is well worth trying. See for yourself on your Python applications by downloading it from https://software.intel.com/en-us/python-distribution.

 


[i] Configuration Info: – Fedora* built Python*: Python 2.7.10 (default, Sep 8 2015), NumPy 1.9.2, SciPy 0.14.1, multiprocessing 0.70a1 built with gcc 5.1.1; Hardware: 96 CPUs (HT ON), 4 sockets (12 cores/socket), 1 NUMA node, Intel(R) Xeon(R) E5-4657L v2@2.40GHz, RAM 64GB, Operating System: Fedora release 23 (Twenty Three)

[ii] Configuration Info: Versions Intel® Distribution for Python 2.7.11.2017 Beta (Mar 08, 2016), Ubuntu® built Python 2.7.11, NumPy 1.10.4, SciPy 0.17.0 built with gcc 4.8.4; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.3 GHz (2 sockets, 16 cores each, HT=OFF), 64 GB RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS; MKL version 11.3.2 for Intel Distribution for Python 2017, Beta

Benchmark disclosure: http://www.intel.com/content/www/us/en/benchmarks/benchmark.html

Optimization Notice: https://software.intel.com/en-us/node/528854

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.  Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.  Any change to any of those factors may cause the results to vary.  You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.   * Other brands and names are the property of their respective owners.   Benchmark Source: Intel Corporation

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.  Notice revision #20110804 .

 

Share this:

  • Twitter
  • Email
  • Google

Filed Under: Featured article, Featured news, News, News Tagged With: Intel Python

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Newsletter

Recent Posts

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

NVIDIA GTC 2018 Shrewdly Incremental to Position NVDA Stock for Massive Growth

NVIDIA GTC 2018 Shrewdly Incremental to Position NVDA Stock for Massive Growth

April 3, 2018 By Rob Farber Leave a Comment

Face It: AI Gets Personal to Make You Look Better!

Face It: AI Gets Personal to Make You Look Better!

March 12, 2018 By admin Leave a Comment

SURFsara Achieves Accuracy and Performance Breakthroughs for Both Deep Learning and Wide Network Training

SURFsara Achieves Accuracy and Performance Breakthroughs for Both Deep Learning and Wide Network Training

November 10, 2017 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Face It: AI Gets Personal to Make You Look Better!
  • Guide to Get Ubuntu 14.10 Running Natively on Nvidia Shield Tablet
  • PyFR: A GPU-Accelerated Next-Generation Computational Fluid Dynamics Python Framework
  • ACM Paper Observes FPGA, GPU, CPU Energy Efficiency Hierarchy
  • Intel tutorial shows how to view OpenCL assembly code
  • Google+
  • Linkedin
  • Twitter

Archives

© 2019 · techenablement.com

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.