Breadth-First Graph Search Uses 2D Domain Decomposition - 400 GTEPS on 4096 GPUs

Parallel Breadth-First Search is a standard benchmark and the basis of many other graph algorithms. The challenge lies in partitioning the graph across multiple nodes in a cluster while avoiding load-imbalance and communications delays. The authors of the paper, “Parallel Breadth First Search on the Kepler Architecture” utilize an interesting 2D decomposition of the graph adjacency matrix. Tests on R-MAT graphs shows large graph performance ranging from 1.1 GTEP on a single K20 to 396 GTEP using 4096 GPUs. The tests also compared performance against the method of Beamer (10 GTEP single SMP device and 240 GTEP on 115k cores).

Read the paper here.

The Distributed BFS Problem In a nutshell

In a distributed memory Breadth-First Search implementation, the graph is partitioned by assigning a subset of the vertices and edges sets to each node. The search is then performed in parallel, starting from the processor owning the root vertex. At each step, processors handling one or more frontier vertices follow the edges connected to them to identify unvisited neighbors. The touched vertices are identified and the search stops when the connected graph component containing the root vertex has been completely visited. The performance challenge lies in the partitioning strategy used to distribute the graph. A worst case choice can result in a single thread running in a cluster while a good choice will keep all the threads busy and avoid communications overhead.

The 2D decomposition was performed by a modification of the method by Yao in the SC05 Conference Proceedings Paper, “A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L“.

For more information on Betweenness Centrality Graph algorithms, don’t miss our article “SC14 – Fast Hybrid GPU Betweenness Centrality Code Achieves Nearly Ideal Scaling to 192 GPUs”