Global financial industry leaders such as Citi and J.P. Morgan have acknowledged they are currently modernizing their code via collaborative efforts with the Intel Software and Solutions Group. Results reported in the recent presentation, Intel’s New STAC A2 Results and Speeding Up FX LSV Monte Carlo case study, demonstrate an overall 32.9X speedup in the financial industry standard STAC-A2 benchmarks over the past two and a half years. This remarkable speedup emphasizes the significance of code modernization efforts, and illustrates one reason why global financial leaders are working to modernize their application source trees. Succinctly, the recent transition to parallel computing and utilizing multi and many core processors means that simply procuring the latest generation hardware is not enough. Software code modernization efforts are required to fully exploit the capabilities – and performance benefits – of these new parallel computing platforms.
Importance of the reported results
A significant performance increase in STAC-A2 benchmarks is important as they are the financial industry standard for evaluating and testing computing platforms to determine how they will perform when running real, business-sensitive (e.g. proprietary), compute-intensive analytic pricing and risk management workloads. Financial institutions on the STAC Benchmark Council maintain the STAC-A2 benchmark standards in order to yield test results that accurately reflect realistic financial workload performance. The allows firms to use vendor-derived performance data to quickly procure the best performing, most cost effective, and highest performance per watt computer technology.
The reported 32.9x speedup in STAC-A2 underscores the rapid rate of change in the industry and the importance of code modernization efforts that stay aligned with hardware roadmaps over the long term. These benchmark results clearly show that even three-year-old highly optimized, hand-tuned application code can no longer be considered competitive from a performance perspective. Code modernization is essential to fully exploit the many-core performance capabilities of today’s parallel computing platforms.
The reported 32.9x speedup in STAC-A2 underscores the rapid rate of change in the industry and the importance of code modernization efforts that stay aligned with hardware roadmaps over the long term
These fastest-ever performance advantages are exemplified in warm runs of the baseline option Greeks calculation benchmark when running in a heterogeneous dual Intel® Xeon Phi™ coprocessor and Intel® Xeon® processor-based system. This is a “code modernized” implementation that uses Intel® Threading Building Blocks (Intel® TBB) and OpenMP to fully exploit the vector, multi-core, and heterogeneous computing capabilities of the runtime hardware environment. The Intel system that ran this benchmark code to achieve the fastest benchmark time utilized two Intel® Xeon Phi™ co-processors 7120P plus a dual Intel® Xeon® processor E5-2697 v3 chipset.
Looking ahead, this same “code modernized” STAC-A2 benchmark source code will likely deliver interesting performance per device, performance per watt, and performance per rack when running on the next generation Intel® Xeon Phi™ processor code name Knights Landing family of devices due to their dual vector units per-core and greater number of cores. Running self-hosted these new Intel® Xeon Phi™ processors should further increase performance as it eliminates offload mode data transfers and any dependency on the PCIe bus. The –no-offload compiler switch is all that is required to turn off offload data transfers. No source code changes should be required.
Optimizing the STAC-A2 Benchmark
Performance improvements in the STAC-A2 benchmark suite by the Intel Software and Solutions group are the result of: (1) better vectorization coupled with exploiting the wider vector units on modern Intel® Xeon® processors and Intel® Xeon Phi™ co-processors, (2) improved multithreading, and (3) exploiting heterogeneity when running in a combined Intel® Xeon® processor and Intel® Xeon Phi™ processor runtime environment.
More precisely from a technical perspective, the current performance-leading benchmark implementation uses an Intel® TBB flow graph, Intel® TBB parallel algorithms and OpenMP 4.0 SIMD vectorization to make best use of two Intel® Xeon Phi™ co-processors 7120P plus a dual Intel® Xeon® processor E5-2697 v3 that were used to run the benchmark. Asynchronous support for the heterogeneous devices is implemented using Intel® TBB heterogeneous flow graph (shown in the block diagram below) to offload work to the Intel® Xeon Phi™ coprocessors as well as run on the Intel Xeon processor.
Dynamic load balancing between the CPU and coprocessors was implemented via a token-based system where tokens are issued based on resource – namely when a device is ready for more work. Intel Principal Engineer Robert Geva noted that it was important to use Intel® TBB to queue tasks as it reduced both threading and parallelization overheads.
The following table shows the various speedups achieved by Intel since the first highly optimized code base was released in 2013 and the impact on the baseline STAC-A2 benchmark using multiple threading and vectorization approaches as well as the performance of various CPU and heterogeneous Intel hardware configurations. Specifically:
- Parallelization: Reflects the progression from OpenMP to Intel® TBB.
- Vectorization: Shows the progression from hand-coded intrinisics to an Intel-only #simd pragma to the current OpenMP standard.
- Heterogeneity: Shows changes from the Intel-only offload pragmas to the OpenMP standard and then to Intel® TBB.
- Greeks time (warm): Performance increased over time through the use of more powerful Intel Architecture hardware including both processors and Intel® Xeon Phi™ co-processors and more advanced C++ programmability.
Qualified STAC members can download any or all of the binaries and run them on their servers to replicate the results.
|Intel Xeon E5 2697-V2 + Xeon Phi||Intel Xeon
|Intel Xeon E5 2699-V3+ Xeon Phi||Intel Xeon E5 2697-V3+ 2 x Xeon Phi|
|Month||Jun 2013||Aug 2013||May 2014||Aug 2014||Aug 2014||Oct 2015|
|STAC SUT ID||INTC130607a||INTC130829||INTC140507||INTC140814||INTC140815||INTC151028|
|Greeks time, warm, in seconds (lower is better)||7.1||4.8||0.63||0.74||0.53||0.22|
Figure 3: Intel code modernization results on STAC-A2 from June 2013 to October 2015 (STAC results courtesy STAC) 
The computation requirements of the financial services industry for both regulatory compliance and alpha generation have grown significantly over the past few years. The stakes in keeping pace with new technology are significant as exemplified by the 32.9x speedup of the STAC-A2 benchmark results over the past two and a half years.
The STAC-A2 benchmark results clearly show that even if your code was highly optimized to run on recent hardware – like the 2013 version of the STAC-A2 code base – there are still opportunities to dramatically improve performance using current software techniques and modern Intel hardware. Simply purchasing new hardware is not a guarantee of better performance as utilizing wider vector instructions like Intel® AVX-512 and scaling multi-threading to utilize many more cores per processor and multiple vector units per core generally requires some form of software modification.
A number of the global banks have acknowledged they are working with the Intel Software and Solutions group to modernize their production codes. Techniques include the use of vector and multi-core parallelism as well as via heterogeneous operation between CPUs and Intel Xeon Phi co-processors. These same code modernization techniques should continue to pay dividends in the future as they will not only increase performance but prepare applications to efficiently run on the next generation of Intel Xeon processors and Intel Xeon Phi processors. The STAC-A2 results demonstrate that the performance benefit today is both significant and very real as modern code can run faster on current hardware while also consuming less power and rack space in the data center.
 Specifically the STAC-A2.β2.GREEKS.TIME.WARM benchmark, comparing the first public results in June 2013 to the latest results in October 2015. Both of these were Intel-based platforms. To see all public STAC-A2 results for Intel-based configurations, see www.STACresearch.com/intelA2
 The official STAC Report™ for each configuration is hyperlinked to the SUT ID in the table. “Greeks time (warm)” refers to the STAC-A2.β2.GREEKS.TIME.WARM benchmark, which is the average of warm runs of the Greeks calculations on an option with the baseline problem size defined by STAC-A2.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as STAC Benchmarks. SYSmark, and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
STAC and all STAC names are trademarks or registered trademarks of the Securities Technology Analysis Center LLC.