• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / CUDA / Breadth-First Graph Search Uses 2D Domain Decomposition – 400 GTEPS on 4096 GPUs

Breadth-First Graph Search Uses 2D Domain Decomposition – 400 GTEPS on 4096 GPUs

August 9, 2014 by Rob Farber Leave a Comment

Parallel Breadth-First Search is a standard benchmark and the basis of many other graph algorithms. The challenge lies in partitioning the graph across multiple nodes in a cluster while avoiding load-imbalance and communications delays. The authors of the paper, “Parallel Breadth First Search on the Kepler Architecture” utilize an interesting 2D decomposition of the graph adjacency matrix. Tests on R-MAT graphs  shows large graph performance ranging from 1.1 GTEP on a single K20 to 396 GTEP using 4096 GPUs. The tests also compared performance against the method of Beamer (10 GTEP single SMP device and 240 GTEP on 115k cores).

Read the paper here.

The Distributed BFS Problem In a nutshell

In a distributed memory Breadth-First Search implementation, the graph is partitioned by assigning  a subset of the vertices and edges sets to each node. The search is then performed in parallel, starting from the processor owning the root vertex. At each step, processors handling one or more frontier vertices follow the edges connected to them to identify unvisited neighbors. The touched vertices are identified and the search stops when the connected graph component containing the root vertex has been completely visited. The performance challenge lies in the partitioning strategy used to distribute the graph. A worst case choice can result in a single thread running in a cluster while a good choice will keep all the threads busy and avoid communications overhead.

The 2D decomposition was performed by a modification of the method by Yao in the SC05 Conference Proceedings Paper, “A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L“.

For more information on Betweenness Centrality Graph algorithms, don’t miss our article “SC14 – Fast Hybrid GPU Betweenness Centrality Code Achieves Nearly Ideal Scaling to 192 GPUs”

graphs

Click image to view article

 

Share this:

  • Twitter

Filed Under: CUDA, News, News Tagged With: CUDA, GPU, NVIDIA

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Intel Broadwell Compute Gen8 GPU Architecture
  • LibreOffice OpenCL Acceleration for the Masses - Intel vs. AMD GPU performance
  • Recovering Speech from a Potato-chip Bag Viewed Through Soundproof Glass - Even With Commodity Cameras!
  • MultiOS Gaming, Media, and OpenCL Using XenGT Virtual Machines On Shared Intel GPUs
  • NVIDIA GTC 2015 keynote - Near-term Roadmap is Deep-Learning

Archives

© 2025 · techenablement.com