• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / Featured article / OpenPower and CUDA-Accelerated Java as a Path Into the Enterprise

OpenPower and CUDA-Accelerated Java as a Path Into the Enterprise

April 14, 2015 by Rob Farber Leave a Comment

TechEnablement spoke with NVIDIA’s John Ashley about the role of CUDA-accelerated Java in the enterprise and the approaches people are taking to accelerate Java with GPUs. Mixing CUDA and Java is an atypical approach because Java’s intent is to abstract the software away from the hardware so it can run anywhere. In contrast, CUDA is specifically designed to provide hardware acceleration on GPUs. According to John, CUDA accelerated Java is, “a powerful statement of intent” that is “a natural evolution as big data gets bigger”.

(image courtesy OpenPower http://openpowerfoundation.org/wp-content/uploads/2015/03/Li-Kelvin_OPFS2015_IBM_031315_final.pdf)

(image courtesy OpenPower http://openpowerfoundation.org/wp-content/uploads/2015/03/Li-Kelvin_OPFS2015_IBM_031315_final.pdf)

NVIDIA has already taken the HPC world by storm as exemplified by the astounding growth and success of CUDA in the scientific, technical, and educational communities. The $325M OpenPower based CORAL procurement illustrates the confidence the US Government has in the IBM Power9 and NVIDIA Volta combination for HPC. It is only natural that this investment be leveraged to benefit OpenPower and enterprise customers as well.

John typified customers interested in CUDA accelerated Java as those, “with performance limited Java applications”. This makes sense, as the CUDA acceleration will benefit the customer while preserving both Java portability and the existing Java software investment.

Claims of 8x performance increases are already being made:

(Image courtesy OpenPower http://openpowerfoundation.org/wp-content/uploads/2015/03/Li-Kelvin_OPFS2015_IBM_031315_final.pdf)

(Image courtesy OpenPower http://openpowerfoundation.org/wp-content/uploads/2015/03/Li-Kelvin_OPFS2015_IBM_031315_final.pdf)

There appear to be three current approaches to CUDA-accelerated Java:

  1. Using a wrapper so Java can call CUDA, or to call CUDA accelerated libraries such as cuFFT, cuBlas, and others
  2. Objects that know how to parallelize themselves
  3. JVM introspection is used to track and identify computationally intensive regions of code, which are then parallelized automatically.

1. Using a wrapper

The wrapper approach is well-known in the community for calling methods in other languages from Java. However, the programmer must define an API much like a library which obscures much of the flexibility of the CUDA language. Common interface generators are jcuda, swig, IBM’s cuda4j, to name a few. John mentioned that jcuda appears to be getting traction with developers. We also observe that the wrapped methods can also be written in device portable languages such as OpenACC and OpenCL as well as CUDA.

2. Objects that know how to parallelize themselves

Exploiting the benefits of Java classes, approaches such as PCJ (Parallel Computing in Java) help to simply the expression of parallelism. This approach appears to be getting recognition as PCJ recently won the HPC Challenge Class 2 Best Productivity Award for the efficient way it enables the programming of parallel applications. PCJ can be downloaded from github (https://github.com/hpdcj/pcj). John mentioned that this approach is, “interesting for both loop and task parallelism”.

To use PCJ, Java programmers form a single class that extends Storage class and implements a StartPoint interface. The Storage class can be used to define shared variables and the StartPoint interface provides the necessary functionality to start threads, enumerate them and perform the initial synchronization of tasks. A PCJ.deploy() method initializes the application using list of nodes provided as third argument. These lists contain internet addresses of the computers (cluster nodes) that are to be used.

PCJ provides the following simple parallel example in the documentation that approximate PI using a Monte Carlo example. It exhibits the following scaling behavior

(Image courtesy the PCJ project)

(Image courtesy the PCJ project)

3. JVM introspection

JVM introspection is very interesting because it requires no action on the part of the programmer. Instead, the JVM keeps track of what it has run and how long it takes. If a loop or region of code consumes significant time, it becomes a candidate for translation to the GPU. First efforts at IBM are  focused on parallelizing lambda functions. In the future, pre-compiled versions of the parallelized code can eliminate the overhead of the first – and potentially very slow and low performance – entry into parallelizable sections of code.

Conclusion

John acknowledges that people are still learning how to accelerate Java with CUDA. In particular, data movement is an issue although TechEnablement notes that claims of NVlink advantages may make this a minor or even a non-issue. Similarly, it is not clear at this time if managed memory will help address either data movement or the details of interfacing with the Java garbage collector.

Still, when performance matters, GPU acceleration of Java can be a viable solution. At the moment, wrappers or the PCJ project are viable approaches, but JVM introspection offers an exciting glimpse into the future.

 

 

Share this:

  • Twitter

Filed Under: Featured article, Featured news, News Tagged With: CUDA Java

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Contact
  • Inside NVIDIA's Unified Memory: Multi-GPU Limitations and the Need for a cudaMadvise API Call
  • Apache Spark Claims 10x to 100x Faster than Hadoop MapReduce
  • Optimizing for Reacting Navier‐Stokes Equations

Archives

© 2025 · techenablement.com