Intel Engineer Robert Ioffe has posted an OpenCL QuickSort tutorial that utilizes nested parallelism and Workgroup-scan functions. In particular, the tutorial shows how to use the OpenCL™ 2.0
enqueue_kernel functions that queue kernels from the device without host intervention (Much like dynamic parallelism) plus
work_group_scan_inclusive_add, two of a new set of work-group functions that were added to OpenCL 2.0 to facilitate scan and reduce operations across work-items of a work-group.
Full source code and discussion can be found on The Code Project.
A strong-scaling across GPUs version of bitonic sort in CUDA can be found in the TechEnablement article, “Part 2: No Idle Time CUDA Task Parallelism Across Eight GPUs”
Note the faster performance (4.5 ms vs 246.9 ms) of bitonic sort on small problems achieved by eliminating recursive calls