Register to attend a webinar about accelerating Python programs using the integrated GPU on AMD Accelerated Processing Units (APUs) using Numba, an open source just-in-time compiler, to generate faster code, all with pure Python. This webinar will be presented by Stanley Seibert from Continuum Analytics, the creators of the Numba project. This webinar is tailored to an audience with intermediate Python and basic NumPy experience.
Following is a short Numba example for an AMD GPU
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
import numpy as np from numba import hsa blocksize = 20 gridsize = 20 @hsa.jit def matmulfast(A, B, C): x = hsa.get_global_id(0) y = hsa.get_global_id(1) tx = hsa.get_local_id(0) ty = hsa.get_local_id(1) sA = hsa.shared.array(shape=(blocksize, blocksize), dtype=float32) sB = hsa.shared.array(shape=(blocksize, blocksize), dtype=float32) if x >= C.shape[0] or y >= C.shape[1]: return tmp = 0 for i in range(gridsize): # preload sA[tx, ty] = A[x, ty + i * blocksize] sB[tx, ty] = B[tx + i * blocksize, y] # wait for preload to end hsa.barrier(1) # compute loop for j in range(blocksize): tmp += sA[tx, j] * sB[j, ty] # wait for compute to end hsa.barrier(1) C[x, y] = tmp N = gridsize * blocksize A = np.random.random((N, N)).astype(np.float32) B = np.random.random((N, N)).astype(np.float32) C = np.zeros_like(A) griddim = gridsize, gridsize blockdim = blocksize, blocksize with hsa.register(A, B, C): ts = timer() matmulfast[griddim, blockdim](A, B, C) te = timer() print("1st GPU time:", te - ts) with hsa.register(A, B, C): ts = timer() matmulfast[griddim, blockdim](A, B, C) te = timer() print("2nd GPU time:", te - ts) ts = timer() ans = np.dot(A, B) te = timer() print("CPU time:", te - ts) np.testing.assert_allclose(ans, C, rtol=1e-5) |
A CUDA example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
from numba import cuda, float32 # Controls threads per block and shared memory usage. # The computation will be done on blocks of TPBxTPB elements. TPB = 16 @cuda.jit def fast_matmul(A, B, C): # Define an array in the shared memory # The size and type of the arrays must be known at compile time sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32) sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32) x, y = cuda.grid(2) tx = cuda.threadIdx.x ty = cuda.threadIdx.y bpg = cuda.gridDim.x # blocks per grid if x >= C.shape[0] and y >= C.shape[1]: # Quit if (x, y) is outside of valid C boundary return # Each thread computes one element in the result matrix. # The dot product is chunked into dot products of TPB-long vectors. tmp = 0. for i in range(bpg): # Preload data into shared memory sA[tx, ty] = A[x, ty + i * TPB] sB[tx, ty] = B[tx + i * TPB, y] # Wait until all threads finish preloading cuda.syncthreads() # Computes partial product on the shared memory for j in range(TPB): tmp += sA[tx, j] * sB[j, ty] # Wait until all threads finish computing cuda.syncthreads() C[x, y] = tmp |
About the presenter:
Dr. Stanley Seibert leads the High Performance Analytics team at Continuum Analytics. He received a Ph.D. in experimental high energy physics from the University of Texas at Austin, and performed research at Los Alamos National Laboratory, University of Pennsylvania, and the Sudbury Neutrino Observatory. Stan has been evangelizing the use of Python and GPU computing for research since 2007. Prior to joining Continuum Analytics, Stan was Chief Data Scientist at Mobi.
Continuum Analytics develops Anaconda, the leading modern open source analytics platform powered by Python. Continuum’s founders and developers have created or contribute to some of the most popular data science technologies, including NumPy, SciPy, Pandas, Jupyter / iPython, and many others.
Read less
Leave a Reply