I currently have nine OpenCL tutorials on The Code Project. OpenCL is quickly evolving and the new 1.2 specification is running on GPU, multicore, and Intel Xeon Phi devices.
To bring more timely information to the on-line community, I have created study guides on the techEnablement website. Check back often as this guide will be updated.
Also, note that techEnablement tutorials will utilize new web technology (see the Web Dev section) for more color coded source code, HTML5, videos, and more.
Following are my online print OpenCL tutorials:
- Part 9: OpenCL Extensions and Device Fission: Learn about OpenCL extensions that provide programmers with additional capabilities such as double-precision arithmetic and Device Fission. (Device Fission provides an interface to subdivide a single OpenCL device into multiple devices – each with a separate asynchronous command queues.)
- Part 8: Heterogeneous workflows using OpenCL: Incorporate OpenCL into heterogeneous workflows via a general-purpose “click together tools” framework that can stream arbitrary messages (vectors, arrays, and arbitrary, complex nested structures) within a single workstation, across a network of machines, or within a cloud computing framework. The ability to create scalable workflows is important because data handling and transformation can be as complex and time consuming as the computational problem used to generate a desired result.
- Part 7 OpenCL plugins: Demonstrates how to create C/C++ plugins that can be dynamically loaded at runtime to add massively parallel OpenCL capabilities to an already running application.
- Part 6 Primitive restart and OpenGL interoperability: OpenGL and OpenCL interoperability can greatly accelerate both data generation as well as data visualization. Basically, the OpenCL application maps the OpenGL buffers so they can be modified by massively-parallel kernels running on the GPU. This keeps the data on the GPU and avoids costly PCIe bus transfers.
- Part 5 OpenCL buffers and memory affinity: The example source code from part 4 was adapted to queue a user specified number of tasks split amongst multiple CPU and GPU command queues. The source code in this article continues to use a simple yet useful preprocessor capability to pass C++ template types to an OpenCL kernel.
- Part 4 Coordinating Computations with OpenCL Queues: Discusses the OpenCL™ runtime and demonstrate how to perform concurrent computations among the work queues of heterogeneous devices.
- Part 3 Work-Groups and Synchronization: Introduces the OpenCL™ execution model and discuss how to coordinate computations among the work items in a work group.
- Part 2 OpenCL Memory Spaces: Implicit in the OpenCL memory model is the idea that the kernel resides in a separate memory space. Each work item can use private memory, local memory, constant memory,and global memory.
- Part 1 OpenCL Portable Parallelism: The big idea behind OpenCL is a portable execution model that allows a kernel to execute at each point in a problem domain.
Here are two examples showing the performance difference between OpenCL rendering a surface using and AMD GPU and the CPU using Primitive Restart. (You can play both simultaneously to really compare the speed difference.)
OpenCL rendering on a CPU using Primitive Restart:. Note 100% utilization of all six CPU cores.
OpenCL rendering on a GPU. Note the dramatic increase in speed because there is no PCI bus limitation!