John Stone (Research Staff, The Beckman Institute) points out that improvements in the AVX-512 instruction set in the Intel Xeon Phi (and latest generation Intel Xeon processors) can deliver significant performance improvements for some time consuming molecular visualization kernels over most existing Intel Xeon CPUs. Based on his recent results using the Intel Xeon Phi hardware exponential instruction Stone notes, “At present, I can say that the Intel Xeon Phi processor is the highest performance CPU result I’ve benchmarked for this molecular orbital algorithm to date.” We discuss Stone’s results in greater detail in this article.
At present, I can say that the Intel Xeon Phi processor is the highest performance CPU result I’ve benchmarked for this molecular orbital algorithm to date – John Stone, The Beckman Institute
Stone’s results reflect a change in the visualization community where CPU-based visualization is now both accepted and viewed as fixing a community wide problem. The 2016 University of Utah presentation Towards Direct Visualization on CPU and Xeon Phi highlights this change in mindset by noting that “if computing is the third pillar of science then visualization is the fourth pillar” yet, “visualization currently can barely handle mid-gigascale data”. This same presentation also notes that visualization, “is two orders of magnitude and ten years behind simulation”.
The reason is that traditional view to run with OpenGL GPU-based rasterization has been designed for millions of polygons while visualization needs to support billions to trillions of elements. The solution is large-scale CPU-based ray tracing using packages such as OSPRay running on big memory CPU nodes and the use of in-situ visualization. Memory capacity is very important, hence the need to use big memory computational nodes. Thus the mindset to “just use the same GPU graphics we use for games” is disappearing to be replaced with CPU-based Software Defined Visualization (SDVis).
Demonstrations at both the Supercomputing 2017 and the Intel Developers Conference in Denver, Colorado make this concrete by showing that even a device that simply displays an image in a framebuffer (e.g. that renders the image in memrouy and provides no hardware acceleration) can be used to interactively visualize even the most complex photorealistic images. For OpenGL users, David DeMarle, (visualization luminary and engineer at Kitware) observes that, “CPU-based OpenGL performance does not trail off even when rendering meshes containing one trillion (10 ** 12) triangles on the Trinity leadership class supercomputer. Further, we might see a 10-20 trillion triangle per second result as our current benchmark used only 1/19th of the machine”. The ability of the CPU to access large amounts of memory is key to realizing trillion triangle per second rendering capability.
CPU-based OpenGL performance does not trail off even when rendering meshes containing one trillion (10 ** 12) triangles on the Trinity leadership class supercomputer. Further, we might see a 10-20 trillion triangle per second result as our current benchmark used only 1/19th of the machine. – David DeMarle, Kitware
Where visualization broke and how Utah used CPUs to fix it
Understanding the distinction between indirect and direct rendering is important to understanding the problem. Indirect rendering is based on triangles, it has a large memory overhead, and introduces a complex, high overhead or “heavy” preprocess pipeline workflow. In contrast, direct rendering is based on volumes (a natural construct for scientific computing), has a low memory overhead, and utilizes a flat preprocess workflow that introduces little or no overhead.
Starting in 2010, the University of Utah team then headed by Aaron Knoll re-envisioned scientific visualization by moving to direct rendering visualization techniques. The early approach used grid-based volumes plus glyphs and ray casting on the GPU. The benefit of direct visualization allowed the team to immediately see and analyze materials data with almost no preprocessing, thus making the transition a win. However, slow data transport across the PCIe bus and lack of memory on the GPU limited success and broke production codes such as Nanovol. The work of Ken-ichi Nomura (then at USC) highlights very slow visualization performance on a 1 GB/time step 15M ANP3 aluminum oxidation data set. This caused the Utah visualization team to examine other approaches including the then highly controversial use of CPUs for visualization to get around the GPU related performance problems.
Just like today, the Utah proposed CPU-based approach drew criticism as, “Vis is graphics and GPUs are designed especially for graphics … right?” The counter argument is that one can perform in-situ visualization on the same CPUs that run the simulation – thus eliminating data movement. This forward thinking approach is now considered by many as a requirement for exascale computing.
From the start the Utah team focused on the defining the right goals and algorithms (e.g. research) to create usable software (to run in production). Thus they took a very pragmatic approach to find and create production quality solutions.
Before 2013, comprehensive CPU rendering solutions did not yet exist plus the CPU-based packages that existed at that time had major shortcomings. However, there was strong evidence that CPU-based visualization was both possible and desirable [] []. Experiments performed between 2013 and 2015 showed that a first generation Intel Xeon Phi coprocessor was able to deliver competitive and sometimes even better performance than GPUs on scientific visualization tasks.
By 2015, The Utah team felt that the OSPRay CPU-based ray tracing package had matured and was production ready. OSPRay specifies an API for visualization that is similar to OpenGL (yet simpler) with the addition of ray tracing and visualization semantics. Knoll points out that OSPRay is, “Often almost as fast (or faster) than GPU approaches – and (almost) never runs out of memory!” (For more information see the paper by Wald et al, OSPRay: A CPU Ray Tracing Framework for Scientific Visualization, IEEE Vis 2016.)
OSPRay is, “Often almost as fast (or faster) than GPU approaches – and (almost) never runs out of memory!” – Aaron Knoll, SCI Institute, University of Utah
Integrated CPU-based visualization and performance
OSPRay has been integrated into popular “indirect” production visualization packages such as ParaView, VisIt, VL3, and VMD. The software is free to use and is licensed according to an open-source BSD clause 2 license. Recent OSPRay software optimizations have significantly increased visualization performance, which has made CPU-based rendering even more competitive compared to a modern GPU solution.
A CPU/GPU bakeoff using VL3 at Utah
Recently, the Utah team performed a vl3-based bakeoff of the latest generation Intel Xeon Phi processor compared to a Pascal-based NVIDIA 1080 GTX GPU. VL3 is a special-purpose, large-scale volume rendering API from Argonne National Laboratory. The bakeoff compared vl3-opsray on Intel Xeon Phi against vl3-GLSL on the GPU.
Basically Intel Xeon Phi processor exceeds GPU performance by a significant margin for larger problems as shown in the performance graph below. Note the y-axis is a log scale. These performance results show that the Intel Xeon Phi processor does surprisingly well with a sweet spot around 1k^3 – 2k^3 volume data. In particular the Intel Xeon Phi MCDRAM helps with this performance sweet spot.
Optimizing Visual Applications for Intel Xeon Phi Processors
Optimizing ray-tracing packages is a big effort that is challenging even for experts.
John Stone (Research Staff, The Beckman Institute) reinforced that ray tracing has many benefits for molecular visualization, particularly with respect to high fidelity lighting and shading of molecular scenes with inherently complex geometry, and he noted that CPU-based ray tracing can achieve high performance, allowing fully interactive display of very large molecular complexes. The latest version of VMD incorporates OSPRay to allow high performance ray tracing on Intel hardware platforms, which helps when running on the latest Xeon Phi processor-based supercomputers at TACC and Argonne National Laboratory.
More specifically, Stone observed in his talk at the November 2016 Intel HPC Developer conference, “Visualization and Analysis of Biomolecular Complexes on Upcoming KNL-based HPC Systems” that the potential of wide SIMD hardware with fast exponential instructions was known as early as 2009. [] In this talk, Stone also highlighted results showing that the work at the Beckman Institute on VMD is still ongoing and that tremendous performance results can still be achieved:
Stone expects that further improvement will require more attention to the details of cache behavior and further tuning of low-level threading constructs for Intel Xeon Phi processors. This is important for Stone’s team and collaborators as they have many Intel AVX-512 wide-vector kernels that need to run fast on systems approaching exascale.
In a more recent (March 2017) work, Stone compared the performance of an Intel Xeon Phi processor 7250 running at 1.4GHz against a dual-socket Intel Xeon processor E5-2687Wv3, 3.1GHz chipset. Stone reports, “Very solidly impressive results for Intel Xeon Phi processors when using the exponential instructions vs. a dual-socket Intel Xeon processor that is using an older SSE+AVX1 kernel”. The Intel Xeon processor had to run the older, hand-coded kernel due to the lack of the hardware exponential instruction. Stone finds that a top Intel Xeon Phi chip is about 4x – 6x faster than most existing Intel Xeon CPUs running the SSE+AVX code path as shown in the table below:
Stone followed up on these results by saying, “There may be ways to improve performance on other kinds of CPUs, but the current Intel Xeon Phi hardware has a big advantage for this code due to its high aggregate FLOP rates (due to very wide vectors and MCDRAM) and also specifically due to the availability of fast exponential approximation instructions, e.g., as available via _mm512_exp2a23_ps() that are part of the AVX-512ER instruction subset that is currently and in the near future, a unique feature of the Intel Xeon Phi hardware relative to the rest of Intel’s CPU offerings. This makes Intel Xeon Phi processors well suited for scientific computing algorithms that use such special functions like this one.”
Just like the University of Utah, Stone also emphasized the importance of memory capacity for ray tracing large biomolecular complexes. He used the following 128 GB images to reinforce this point:
CPU-based visualization performance results
CPU-based visualization also helps with compositing and the incorporation of new algorithms such as p-k-d trees.
As seen in the results below, the University of Utah team observes consistently interactive compositing up to 4k resolution on 1k nodes.Dynamically scheduled region-based compositing is also being used to speed large-scale unbalanced visualization workloads. The Utah team recommends that similar approaches can be used for distributed rendering with the OSPRay framework.
Balanced k-d trees, or “point” k-d trees (P-k-d trees) have provided some spectacular speedups and visual capabilities as illustrated in the following image of a 100M atom Al2-O3 Sic alumina-coated nanoparticle MD simulation rendered in OSPRay at 2-4 fps at 4k resolution [.
The following CPU-based simulation provides a visual comparison of the power of P-k-d trees and CPU-based visualization. These results modeled approximately 30 billion particles in the early universe. This image was rendered at 6 fps using ray-tracing on a single Intel Xeon E7-8890 v3 system with 3 TB of shared memory.In contrast a 128-GPU cluster with 1 TB of distributed memory was utilized to render the following 28 billion particle simulation.
Figure 10: 28 billion particles: ~20 megapixels/s 
Since 2013, the capabilities of CPU-based visualization packages and hardware has grown by leaps and bounds. Third-party use cases, images, and peer-reviewed articles now demonstrate the validity of a CPU-based direct visual approach. Open-source visualization software such as OSPRay and new algorithms such as P-k-d trees mean that the general scientific community needs to re-think visualization and their visualization hardware platforms. The fact that a single large memory workstation can deliver competitive and even superior interactive rendering performance compared to a 128-node GPU cluster is paradigm changing. On-going research at the University of Utah that includes the exploration of in-situ visualization using P-k-d trees show that new algorithms and in-situ visualization are hot research topics that promise to eliminate data movement and bring simulation and visualization computational performance back into balance.
Rob Farber is a global technology consultant and author with an extensive background in HPC and in developing machine learning technology that he applies at national labs and commercial organizations. Rob can be reached at firstname.lastname@example.org.
 Knoll et al., Pacific Vis, “Volume rendering an 8 GB dataset on 8-core CPU workstation faster than a 128-node GPU cluster”, 2011
 Wald et al., Siggraph, “Embree: acceleration structure builds are no longer a major bottleneck”, 2014).
 http://www.intel.com/content/dam/www/public/us/en/documents/presentation/sw-vis-john-stone-vis-analy-of-biom-compl-on-knl-based-hpc-sys.pdf. Funding based on NSF OCI 07-25070 – NSF PRAC “The Computational Microscope” – NIH support: 9P41GM104601, 5R01GM098243-02.
 P. Grosset, M. Prasad, C. Christensen, A. Knoll, C.D. Hansen. “TOD-Tree: Task-Overlapped Direct send Tree Image ComposiYng for Hybrid MPI Parallelism”. Proceedings of Eurographics Symposium on Parallel Graphics and VisualizaYon (EGPGV) 2015
 Pascal Grosset, Aaron Knoll, Chuck Hansen. “Dynamically Scheduled Region-based Compositing.” Eurographics Symposium on Parallel Graphics and Visualization 2016.
 Ingo Wald, Aaron Knoll, Gregory P. Johnson, Will Usher, Valerio Pascucci and Michael E. Papka. CPU Ray Tracing Large Particle Data with Balanced P-k-d Trees. IEEE Vis 2015
 I Wald, A Knoll, G Johnson, W Usher, M E Papka, V Pascucci. “CPU Ray Tracing Large Particle Data with Balanced P-k-d Trees”, IEEE Vis 2015