Free Windows courses by themselves are not newsworthy, but those who wish to create Windows 10 apps for the Windows Marketplace - AND exploit the power of CUDA and OpenCL computing via C# should find the Free Microsoft course in combination with the TechEnablement tutorial "Combine C-Sharp With CUDA and OpenCL On Linux, iOS, Android and Windows" an enabling pair of … [Read more...]
Wonderful Teaching Video – The Zipf Mystery
My ten year old introduced me to the wonderful new video on YouTube, "The Zipf Mystery". It is 20 minutes well spent for all interested in information theory, computer science, computational drug design, social media, machine-learning, and a huge number of other real-world relevant areas of research and application. https://youtu.be/fCn8zs912OE Consider this a form of … [Read more...]
Free Online OpenACC Course Starting Oct. 1 2015
NVIDIA is providing a free, interactive, online OpenACC course starting on October 1, 2015. The course is made up of four instructor-led classes that include interactive lectures, hands-on exercises, and live office hours with the instructors. Register here! Classes start at 9am PT on Thursdays and "Office Hours" are 9AM PT on Tuesdays. All sessions will be recorded for … [Read more...]
Intel Xeon Phi Optimization Part 1 of 3: Multi-Threading and Parallel Reduction
This tutorial begins a 3-part series of educational publications on performance optimization in applications for Intel Xeon Phi coprocessors. In this publication, Ryo Asai (a Researcher at Colfax International) and Andrey Vladimirov (Head of HPC Research at Colfax International) will focus on some aspects of thread parallelism implementation in the OpenMP … [Read more...]
Port Some CUDA Codes To Intel Xeon Phi Simply and Efficiently
This tutorial shows that it relatively easy to port many CUDA C/C++ source codes to OpenMP. In the past, such efforts were not generally considered worthwhile because of the large performance difference between multicore processors (that use OpenMP) and GPUs. The introduction of teraflop/s Intel Xeon Phi coprocessors eliminated that performance difference, which makes it much … [Read more...]
PGI Compiled OpenACC ILP Loop Beats CUDA-7 by 200 GF/s on Deep-learning PCA Example
The PGI OpenACC compiler beat the performance of a CUDA 7.0 NVIDIA nvcc compiled deep-learning based PCA (Principal Components Analysis) example by 200 GF/s on a K40c using an ILP (Instruction Level Parallelism) loop structure taught in the TechEnablement classes and forthcoming Farber OpenACC book. PCA is an important data analysis tool utilized by data scientists. Sign up for … [Read more...]
OpenCL SPIR Tutorial Teaches Portability Without Shipping Kernel Source
Intel has released an OpenCL tutorial showing how developers can use SPIR (Standard Portable Intermediate Representation) to preserve vendor and device portability without having to ship OpenCL kernel source code. For more information about how SPIR enables commercial OpenCl applications, see our article, "Commercial OpenCL! SPIR 2.0 Protects IP Yet Allows Powerful, Portable, … [Read more...]
Tutorial on the OpenCL 2.0 Generic Address Space
Adam Lake and Robert Ioffe posted a nice tutorial on the Intel website about the new OpenCL 2.0 generic address space. The OpenCL 2.0 generic address space makes writing OpenCL programs easier by removing the requirement of decorating all pointers with a points to address space. Instead, OpenCL programmers just use pointers as they would in standard C. Utilizing this new … [Read more...]
Intel Posts OpenCL 2.0 QuickSort Tutorial (Compare to TE CUDA Version)
Intel Engineer Robert Ioffe has posted an OpenCL QuickSort tutorial that utilizes nested parallelism and Workgroup-scan functions. In particular, the tutorial shows how to use the OpenCL™ 2.0 enqueue_kernel functions that queue kernels from the device without host intervention (Much like dynamic parallelism) plus work_group_scan_exclusive_add and … [Read more...]
Fine-Tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors
Andrey Vladimirov at ColFax International has posted source code and a paper, "Fine-Tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors: LU Decomposition of Small Matrices" on the ColFax site. Andrey notes, "Benchmarks show that the discussed optimizations improve the application performance on the coprocessor by a factor of 2.8 compared to the unoptimized … [Read more...]