The CUDA Thrust API now supports streams and concurrent kernels through the use of a new API called Bulk created by Jared Hoberock at NVIDIA. The design of Bulk is intended to extend the parallel execution policies described in the evolving Technical Specification for Parallel Extensions for C++ N3960. Note that bulk is not part of the CUDA 6.0 distribution and must be downloaded from https://github.com/jaredhoberock/bulk.
Bulk leverages Hyper-Q and CUDA streams to run concurrent tasks on the GPU. It lets the programmer describe a parallel task (e.g. sort, for_each, reduction, etcetera) as a hierarchical grouping of execution agents.
The big news is that concurrent kernel execution occurs with bulk without having to:
- Specify a launch configuration
- Decompose the problem into sub-tasks
- Marshal parameters
CUDA programmers will appreciate that bulk::par() can take an existing CUDA stream as a parameter to indicate that any CUDA kernels it launches be enqueued on that stream. The async_reduce.cu utilizes this capability. ( Note the use of par(s1;1) in the async call below.)
template<typename Iterator, typename Pointer>
__device__ void operator()(volatile bool *wait_for_me, Iterator first, Iterator last, Pointer result)
*result = thrust::reduce(thrust::device, first, last);
__device__ void operator()(bool *set_me)
*set_me = true;
thrust::device_vector<int> vec(1 << 20);
// note we launch the reduction before the greenlight
async(par(s1,1), reduce_kernel(), thrust::raw_pointer_cast(flag.data()), vec.begin(), vec.end(), result.begin());
async(par(s2,1), greenlight(), thrust::raw_pointer_cast(flag.data()));
std::cout << "result: " << thrust::reduce(vec.begin(), vec.end()) << std::endl;
std::cout << "asynchronous result: " << result << std::endl;
assert(thrust::reduce(vec.begin(), vec.end()) == result);
Individually, Bulk agents are like tiny, lightweight task-parallel threads, but when grouped together they have the potential to fully utilize the GPU. Parallel groups of agents are launched with bulk::async.
One can think of the bulk execution agents as logical lightweight threads, but in reality they may not run as separate physical threads. Think instead in terms of a grainsize parameter, which is a quantum of work where the grainsize parameter controls the size of sequential work that an individual execution agent performs. So it is orthogonal to something like warpsize or vector width. Grainsize is the “vertical” dimension.
Bulk uses a launch configuration heuristic bulk::par(n) to automagically configure the kernel launch. Occupancy is basically maximized by finding the largest block size that will maximize occupancy.
Jared explains this mapping to the hardware in the following way:
You can describe most kinds of processors via some sort of hierarchy. CPUs are decomposed into cores, each of which have some vector lanes. You could describe the hierarchy of NVIDIA GPUs as a set of cores, each of which have some warps, each of which have some vector lanes. Putting multiple GPUs into the machine might add an additional level. So there’s some actual hardware that corresponds to the physical machine, and sometimes it’s helpful to describe it as a hierarchy.
The physical hierarchy varies from machine to machine. For example, GPU A might have P cores, and GPU B might have Q cores. As a programmer, you might want to ignore this minor difference between GPU A & GPU B because it it can be burdensome. You might instead wish to think about *all* GPUs as having N cores. Maybe you want N itself to vary depending on the task the program is executing that instant. This is what is meat by a virtual hierarchy – it’s just a decomposition of the processor into logical components in the same way that CUDA thread blocks can be thought of as logical multiprocessors.
There are still other cases where the programmer doesn’t want to think about configuring any of this at all, and that’s where Bulk’s automatic launch configuration comes in.
Bulk also provides a performance enhancement, shared dynamic memory allocation for groups are performed as a single bulk allocation via a call to bulk::malloc().
In my own tests of cudaMalloc, as noted in my GTC 2014 presentation “Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and more” [video][pdf] plus Dr. Dobbs tutorial “A Massively Parallel Stack“, allocating a single block of a million objects takes 0.00001766 seconds compared to individually allocating a million objects which takes cudaMalloc 7.86 seconds. As you can see, utilizing bulk::malloc() can have an appreciable performance benefit for your concurrent Thrust applications.
For more information about Bulk, check out Jared’s GTC 2014 presentation, “Building Parallel Algorithms with Bulk“.