Intel has created a freely downloadable, optimized Python distribution that can greatly accelerate Python codes. Benchmarks show that two order of magnitude speedups (over 100x) can be achieved by using the Intel Distribution for Python. The Intel® Distribution for Python 2017 Beta program (product release will be in September) provides free access to this optimized version. Intel Python delivers performance improvements for NumPy/SciPy through linking with performance libraries like Intel® Math Kernel Library (Intel® MKL), Intel® Message Passing Interface (Intel® MPI), Intel® Threading Building Blocks (Intel® TBB) and Intel® Data Analytics Library (Intel® DAAL). Intel Python supports Python version 2.7 and 3.5 running on Windows*, Linux* and Mac* OS.
Python is a powerful and popular scripting language that provides fast and fundamental tools for scientific computing through numeric libraries such as NumPy and SciPy. Not just for developers, Intel Python offers support for advanced analytics, numerical computing, just-in-time compilation, profiling, parallelism, interactive visualization, collaboration and other analytic needs. Intel Python is based on Continuum Analytics Anaconda distribution, allowing users to install Intel Python into their Anaconda environment as well as to use Intel Python with the Conda packages on Anaconda.org. Similarly, Python packages such as Theano provide significant machine and deep learning speedups.
Michele Chamber, VP of Products & CMO at Continuum Analytics said, “Python is the defacto data science language that everyone from elementary to graduate school is using because it’s so easy to get started and powerful enough to drive highly complex analytics.”
Python is the defacto data science language that everyone from elementary to graduate school is using because it’s so easy to get started and powerful enough to drive highly complex analytics – Michele Chamber VP and CMO at Continuum Analytics
With over 3 million users. The Anaconda eco-system is large and growing. For example, Anaconda powers Python for Microsoft’s Azure ML platform and Continuum recently partnered with Cloudera on a certified Cloudera parcel, which will help bring optimized Intel Python to the cloud.
Two orders of magnitude speedups
Popular in the scientific and commercial communities, Python applications can be interpreted or compiled. Traditionally, high performance can be achieved by calling optimized native libraries. Continuum Analytics, for example, notes that customers have experienced up to 100x performance increases by using Intel MKL.
Optimized for individual Intel Architectures (IA), significant speedups can be achieved on both Intel Xeon and Intel Xeon Phi processor platforms.
Pure interpreted Python is slow relative to an implementation that calls a native library, in this case NumPy, while Intel optimized multithreaded Python provides additional performance (see figure 2) for a wide range of algorithms (figure 3) by leveraging SIMD and multicore.
Intel Xeon Phi processor performance and benefits
Colfax Research shows up to a 154x speedup over standard Python when running on the latest 2nd generation Intel Xeon Phi processors.
The speedup was achieved using default settings without any special tuning. Thus the performance corresponds to what a normal Python user would experience with one exception: the entire application was placed in the high-speed near-memory (also called MCDRAM or High-bandwidth memory) on the 64-core Intel Xeon Phi processor 7210. The Intel Xeon Phi processor near-memory was used in flat mode, i.e., exposed to the programmer as addressable memory in a separate NUMA node, which simply required placing a numactl command in-front of the executable Python script. (MCDRAM as High-Bandwidth Memory in Knights Landing Processors: Developer’s Guide from Colfax provides more information about Intel Xeon Phi processor memory modes.)
$ numactl -m 1 benchmark-script.py
Figure 5: Command to run in high-bandwidth memory on the Intel Xeon Phi Processor
This Python benchmark achieved 1.85 TFlop/s double-precision performance. This high flop-rate reflects 70% of the theoretical peak 64-bit arithmetic performance for the Intel Xeon Phi processor 7210. Colfax notes that during profiling they say that the vector capability of the Intel Xeon Phi processor is fast enough that the MKL time to perform the computation actually took less time than the Python call to the MKL library for many smaller problems.
On the basis of their benchmark results, Colfax Research notes, “the usage of Intel MKL remains crucial for extracting the best performance out of Intel architecture”.
The Intel Xeon Phi near-memory benefits all memory bound applications. For example, sparse matrix libraries (which can be called from Python) typically cause performance issues because they are memory intensive and they tend to hop around in memory, which means they tend to be both access latency and memory bandwidth bound. “Our sparse linear algebra library greatly benefits from the massive bandwidth of the Intel Xeon Phi processor. Thanks to MKL’s cross-platform compatibility, we were able to port it in a matter of hours,” stated Mauricio Hanzich of the Barcelona Supercomputing Center’s HPC Software Engineering Group. “Xeon Phi’s vector registers along with its massive memory bandwidth are just the perfect combination for finite differences schemes.” Most machine learning algorithms will benefit from the higher bandwidth during training plus sparse matrix operations are useful for data preprocessing.
Xeon Phi processor’s vectorial registers along with its massive memory bandwidth are just the perfect combination for finite differences schemes – Mauricio Hanzich, Barcelona Supercomputing Center
The performance on traditionally memory unfriendly algorithms such as LU decomposition, Cholesky Decomposition, and SVD (Singular Value Decomposition) are shown in the following three figures. Performance speedups for the Intel optimized Python range from 7x – 29x.(Source: Colfax Research.)
Intel Optimized Python for deep and machine learning
PyDAAL provides Python interfaces to the Intel® Data Analytics Acceleration Library (DAAL).
Intel DAAL is an IA-optimized library that provides building blocks for the data analytics stages from data preparation to data mining and machine learning. A standalone community license version is available for free download, or as part of Intel Parallel Studio XE. (The Community Licensing for Intel Performance Libraries license means the library is free for anyone who registers, with no royalties, and no restrictions on company or project size.)
The library is composed of a series of building blocks that can be combined in a performant and scalable fashion across a wide range of Intel processors and be integrated into big data analytic workflows. It can handle computations that are too big to fit into memory using out-of-core algorithms. There are three supported processing modes:
- Batch processing – Processes all the data at once via in-memory algorithms.
- Online processing (also called Streaming) – Utilizes out-of-core algorithms that processes data in chunks and incrementally updates the partial result which then is used to create the final result.
- Distributed processing – Using a model similar to MapReduce, Consumers in a cluster process local data (map stage), after which a Producer process collects and combines partial results from the consumers (reduce stage). The communications functions can be completely written by the developer, which gives developers the ability to adapt the library to a variety of communications frameworks such as Hadoop or Spark, or via explicitly coded communications using a framework like MPI.
Theano is a Python library that lets researchers transparently run deep learning models on CPUs and GPUs. It does so by generating C++ code from the Python script for both CPU and GPU architectures. Intel has optimized the popular Theano deep learning framework is used to perform computational drug discovery and machine-learning based big data analytics.
The following graph shows the benchmark results for two DBN (Deep Belief Network) neural network configurations (e.g. architectures). The optimized code delivers an 8.78x performance improvement for the larger DBN containing 2,000 hidden neurons over the original open source implementation. These results also show that a dual-socket Intel® Xeon E5-2699v3 (Haswell architecture) chipset delivers a 1.72x performance improvement over an NVIDIA K40 GPU using 16-bit arithmetic (which can double GPU memory bandwidth).
Optimized threading for Python using Intel TBB
The Intel optimized Python also utilizes Intel TBB to provide a more efficient threading model for multi-core processors.
It’s quite easy to try. Simply pass the –m Intel TBB argument to Python:
$ python -m TBB <your>.py
Figure 13: Python command-line invocation that will use TBB
Intel TBB can accelerate programs by avoiding inefficient threads allocation (called oversubscription) when there are more software threads utilized than available hardware resources.
The following complete example demonstrates how simple it is to use Intel TBB in Python. Intel says this example runs 50% faster depending on Intel Xeon processor configuration and number of processors.
import dask, time
import dask.array as da
x = da.random.random((100000, 2000), chunks=(10000, 2000))
t0 = time.time()
q, r = da.linalg.qr(x)
test = da.all(da.isclose(x, q.dot(r)))
assert(test.compute()) # compute(get=dask.threaded.get) by default
print(time.time() - t0)
Figure 14: A complete Python thread oversubscriptop example that uses Intel TBB (Example courtesy Intel)
Dask is a flexible parallel computing library for analytics. This example splits an array into chunks and performs a QR decomposition using multiple threads in parallel. Each Dask task processes a chunk via a multi-threaded Intel MKL call. Thus, this simple example actually implements a complex multi-threaded nested-parallelism QR decomposition that oversubscribes the available hardware resources.
Optimized MPI for Python using Intel MPI
MPI (Message Passing Interface) is the de facto standard distributed communications framework for scientific and commercial parallel distributed computing. The Intel® MPI implementation is a core technology in Intel® Scalable System Framework that provides programmers a “drop-in” MPICH replacement library that can deliver the performance benefits of the Intel® Omni-Path Architecture (Intel® OPA ) communications fabric plus high core count Intel® Xeon and Intel® Xeon Phi™ processors. “Drop-in” literally means that programmers can set an environmental variable to dynamically load the highly tuned and optimized Intel MPI library – no recompilation required!
The Intel MPI team has spent a significant amount of time tuning the Intel MPI library to different processor families plus network types and topologies. For example, shared memory is particularly important on high core count processors as data can be shared between cores without the need for a copy operation. DMA mapped memory and RDMA (Remote Direct Memory Access) operations are also utilized to prevent excess data movement. Only when required are optimized memory copy operations utilized depending on the processor uArch. Special support is also provided for the latest 2nd generation Intel Xeon Phi processor (codename Knights Landing).
Optimized random number generation
The latest version of the Intel® Distribution for Python* 2017 Beta introduces numpy.random_intel, an extension to numpy which closely mirrors the design of numpy.random and uses Intel MKL’s vector statistics library to achieve significant performance boost. Simply replace numpy.random with numpy.random_intel to realize a significant performance boost (up to 57.44x) in random number generation performance (Source: Intel Corporation). Random number generation can be a bottleneck in Monte Carlo and other statistical methods.
|Total timing of sampling of 100,000 variates repeated 256 times|
|Distribution||timing(random) in secs||timing(random_intel) in secs||speedup factor|
|hypergeometric(214, 97, 83)||11.353||0.517||21.96|
Figure 15: Random number performance speedups (Results courtesy Intel)
Intel Vtune also supports profiling Python programs. Figure 16 shows an example screenshot of the profile information available for a Python application that uses OpenMP to distribute a matrix multiplication across nodes that also uses MKL multi-threading to parallelize and vectorize the computation within a node.
The optimized Intel Python distribution is available to try. Current benchmark results indicate it is well worth trying. See for yourself on your Python applications by downloading it from https://software.intel.com/en-us/python-distribution.
[i] Configuration Info: – Fedora* built Python*: Python 2.7.10 (default, Sep 8 2015), NumPy 1.9.2, SciPy 0.14.1, multiprocessing 0.70a1 built with gcc 5.1.1; Hardware: 96 CPUs (HT ON), 4 sockets (12 cores/socket), 1 NUMA node, Intel(R) Xeon(R) E5-4657L email@example.comGHz, RAM 64GB, Operating System: Fedora release 23 (Twenty Three)
[ii] Configuration Info: Versions Intel® Distribution for Python 126.96.36.1997 Beta (Mar 08, 2016), Ubuntu® built Python 2.7.11, NumPy 1.10.4, SciPy 0.17.0 built with gcc 4.8.4; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.3 GHz (2 sockets, 16 cores each, HT=OFF), 64 GB RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS; MKL version 11.3.2 for Intel Distribution for Python 2017, Beta
Benchmark disclosure: http://www.intel.com/content/www/us/en/benchmarks/benchmark.html
Optimization Notice: https://software.intel.com/en-us/node/528854
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .