TechEnablement spoke with NVIDIA’s John Ashley about the role of CUDA-accelerated Java in the enterprise and the approaches people are taking to accelerate Java with GPUs. Mixing CUDA and Java is an atypical approach because Java’s intent is to abstract the software away from the hardware so it can run anywhere. In contrast, CUDA is specifically designed to provide hardware acceleration on GPUs. According to John, CUDA accelerated Java is, “a powerful statement of intent” that is “a natural evolution as big data gets bigger”.

(image courtesy OpenPower http://openpowerfoundation.org/wp-content/uploads/2015/03/Li-Kelvin_OPFS2015_IBM_031315_final.pdf)
NVIDIA has already taken the HPC world by storm as exemplified by the astounding growth and success of CUDA in the scientific, technical, and educational communities. The $325M OpenPower based CORAL procurement illustrates the confidence the US Government has in the IBM Power9 and NVIDIA Volta combination for HPC. It is only natural that this investment be leveraged to benefit OpenPower and enterprise customers as well.
John typified customers interested in CUDA accelerated Java as those, “with performance limited Java applications”. This makes sense, as the CUDA acceleration will benefit the customer while preserving both Java portability and the existing Java software investment.
Claims of 8x performance increases are already being made:

(Image courtesy OpenPower http://openpowerfoundation.org/wp-content/uploads/2015/03/Li-Kelvin_OPFS2015_IBM_031315_final.pdf)
There appear to be three current approaches to CUDA-accelerated Java:
- Using a wrapper so Java can call CUDA, or to call CUDA accelerated libraries such as cuFFT, cuBlas, and others
- Objects that know how to parallelize themselves
- JVM introspection is used to track and identify computationally intensive regions of code, which are then parallelized automatically.
1. Using a wrapper
The wrapper approach is well-known in the community for calling methods in other languages from Java. However, the programmer must define an API much like a library which obscures much of the flexibility of the CUDA language. Common interface generators are jcuda, swig, IBM’s cuda4j, to name a few. John mentioned that jcuda appears to be getting traction with developers. We also observe that the wrapped methods can also be written in device portable languages such as OpenACC and OpenCL as well as CUDA.
2. Objects that know how to parallelize themselves
Exploiting the benefits of Java classes, approaches such as PCJ (Parallel Computing in Java) help to simply the expression of parallelism. This approach appears to be getting recognition as PCJ recently won the HPC Challenge Class 2 Best Productivity Award for the efficient way it enables the programming of parallel applications. PCJ can be downloaded from github (https://github.com/hpdcj/pcj). John mentioned that this approach is, “interesting for both loop and task parallelism”.
To use PCJ, Java programmers form a single class that extends Storage class and implements a StartPoint interface. The Storage class can be used to define shared variables and the StartPoint interface provides the necessary functionality to start threads, enumerate them and perform the initial synchronization of tasks. A PCJ.deploy() method initializes the application using list of nodes provided as third argument. These lists contain internet addresses of the computers (cluster nodes) that are to be used.
PCJ provides the following simple parallel example in the documentation that approximate PI using a Monte Carlo example. It exhibits the following scaling behavior
3. JVM introspection
JVM introspection is very interesting because it requires no action on the part of the programmer. Instead, the JVM keeps track of what it has run and how long it takes. If a loop or region of code consumes significant time, it becomes a candidate for translation to the GPU. First efforts at IBM are focused on parallelizing lambda functions. In the future, pre-compiled versions of the parallelized code can eliminate the overhead of the first – and potentially very slow and low performance – entry into parallelizable sections of code.
Conclusion
John acknowledges that people are still learning how to accelerate Java with CUDA. In particular, data movement is an issue although TechEnablement notes that claims of NVlink advantages may make this a minor or even a non-issue. Similarly, it is not clear at this time if managed memory will help address either data movement or the details of interfacing with the Java garbage collector.
Still, when performance matters, GPU acceleration of Java can be a viable solution. At the moment, wrappers or the PCJ project are viable approaches, but JVM introspection offers an exciting glimpse into the future.
Leave a Reply