• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / Featured tutorial / Webinars Showing How to GPU Accelerate Python With Numba

Webinars Showing How to GPU Accelerate Python With Numba

November 24, 2015 by Rob Farber Leave a Comment

Register to attend a webinar about accelerating  Python programs using the integrated GPU on AMD Accelerated Processing Units (APUs) using Numba, an open source just-in-time compiler, to generate faster code, all with pure Python. This webinar will be presented by Stanley Seibert from Continuum Analytics, the creators of the Numba project. This webinar is tailored to an audience with intermediate Python and basic NumPy experience.

Following is a short Numba example for an AMD GPU

An AMD GPU example
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from numba import hsa
 
 
blocksize = 20
gridsize = 20
 
@hsa.jit
def matmulfast(A, B, C):
    x = hsa.get_global_id(0)
    y = hsa.get_global_id(1)
 
    tx = hsa.get_local_id(0)
    ty = hsa.get_local_id(1)
 
    sA = hsa.shared.array(shape=(blocksize, blocksize), dtype=float32)
    sB = hsa.shared.array(shape=(blocksize, blocksize), dtype=float32)
 
    if x >= C.shape[0] or y >= C.shape[1]:
        return
 
    tmp = 0
 
    for i in range(gridsize):
        # preload
        sA[tx, ty] = A[x, ty + i * blocksize]
        sB[tx, ty] = B[tx + i * blocksize, y]
        # wait for preload to end
        hsa.barrier(1)
        # compute loop
        for j in range(blocksize):
            tmp += sA[tx, j] * sB[j, ty]
        # wait for compute to end
        hsa.barrier(1)
 
    C[x, y] = tmp
 
N = gridsize * blocksize
A = np.random.random((N, N)).astype(np.float32)
B = np.random.random((N, N)).astype(np.float32)
C = np.zeros_like(A)
 
griddim = gridsize, gridsize
blockdim = blocksize, blocksize
 
with hsa.register(A, B, C):
    ts = timer()
    matmulfast[griddim, blockdim](A, B, C)
    te = timer()
    print("1st GPU time:", te - ts)
 
with hsa.register(A, B, C):
    ts = timer()
    matmulfast[griddim, blockdim](A, B, C)
    te = timer()
    print("2nd GPU time:", te - ts)
 
ts = timer()
ans = np.dot(A, B)
te = timer()
print("CPU time:", te - ts)
np.testing.assert_allclose(ans, C, rtol=1e-5)

A CUDA example

A CUDA example
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from numba import cuda, float32
 
# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16
 
@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
 
    x, y = cuda.grid(2)
 
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid
 
    if x >= C.shape[0] and y >= C.shape[1]:
        # Quit if (x, y) is outside of valid C boundary
        return
 
    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx, ty] = A[x, ty + i * TPB]
        sB[tx, ty] = B[tx + i * TPB, y]
 
        # Wait until all threads finish preloading
        cuda.syncthreads()
 
        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]
 
        # Wait until all threads finish computing
        cuda.syncthreads()
 
    C[x, y] = tmp

About the presenter:
Dr. Stanley Seibert leads the High Performance Analytics team at Continuum Analytics. He received a Ph.D. in experimental high energy physics from the University of Texas at Austin, and performed research at Los Alamos National Laboratory, University of Pennsylvania, and the Sudbury Neutrino Observatory. Stan has been evangelizing the use of Python and GPU computing for research since 2007. Prior to joining Continuum Analytics, Stan was Chief Data Scientist at Mobi.

Continuum Analytics develops Anaconda, the leading modern open source analytics platform powered by Python. Continuum’s founders and developers have created or contribute to some of the most popular data science technologies, including NumPy, SciPy, Pandas, Jupyter / iPython, and many others.
Read less

Share this:

  • Twitter

Filed Under: Featured tutorial, Tutorials Tagged With: AMD, CUDA, GPU, HPC

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Face It: AI Gets Personal to Make You Look Better!
  • CUDA Study Guide
  • PyFR: A GPU-Accelerated Next-Generation Computational Fluid Dynamics Python Framework
  • Paper Compares AMD, NVIDIA, Intel Xeon Phi CFD Turbulent Flow Mesh Performance Using OpenMP and OpenCL
  • Up To Orders of Magnitude More Performance with Intel's Distribution of Python

Archives

© 2023 · techenablement.com