Metagenomic Sequence Clustering using CUDA-enabled GPUs

There have been huge advances in DNA sequencing technologies in recent years; e.g. Illumina has just announced the HiSeq X system which can sequence human genomes at a cost of only $1000 per genome. Besides population-scale human genome sequencing another important application of sequencing technologies is environmental sequencing (so called metagenomics). Metagenomic studies usually involve the sequencing of a sample of cells collected from their environment (such as soil, marine water, or human gut). Analyzing metagenomic sequencing data is computational challenging because of the diverse mix of sequenced genomes and the large data size.

CRiSPy (Computing Richness in Pyrosequencing Datasets) is a software tool developed by PhD student Nguyen Thuy Diem from Nanyang Technological University (Singapore) and Prof. Bertil Schmidt from the University of Mainz (Germany) that deals with estimating the genomic richness and variation within a microbial sequencing dataset. More specifically, CRiSPy is a de-novo (i.e. reference database independent) clustering technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). The processing pipeline for computing an OTU clustering is based on hierarchical clustering and is shown in Figure 1. CRiSPy is freely available at https://github.com/ngthuydiem/crispy.

The processing pipeline of CRiSPy (rectangular boxes represent computation and oval boxes denote data). The dashed boxes belong to the preprocessing step using third-party error correction and chimera removal tools.

CRiSPy-CUDA aims at producing high-quality clustering results at a reasonably low computational cost. In particular, we tackle two computational issues of the OTU hierarchical clustering approach:

High computational complexity (i.e. the compute-intensive sequence alignment operation for building a pair-wise distance matrix)
High memory complexity (i.e. the quadratic memory requirement of the hierarchical clustering procedure)

The first issue is addressed by using CUDA-enabled GPUs for parallelizing the two-staged pairwise distance computation (i.e. k-mer distances first (as a computationally cheaper filter) and then the more accurate genetic distances for all pairs with a low k-mer distance). Taking full advantage of fast CUDA shared memory, we are able to achieve speedups of around 100x for the genetic distance computation (based on dynamic programming) and around 50x for the k-mer distance computation on a GeForce GTX 580 compared to a single-core CPU code.

The second issue is addressed by the design of a memory-efficient sparse matrix hierarchical clustering algorithm. Even though we use a sparse representation of the calculated distance matrices, typical sizes of large matrices can still exceed the RAM capacity of a standard workstation. Thus, we have developed a fast and memory-efficient hierarchical clustering algorithm for sparse matrices called SparseHC. SparseHC scans a sorted sparse distance matrix chunk-by-chunk and a dendrogram is built by merging cluster pairs as soon as the distance between them is determined to be the smallest among all remaining cluster pairs. The main insight used is that it is unnecessary to wait for the completion of all cluster pair distance computations for finding the cluster pair with the smallest distance. Partial information can be used to determine lower bounds on cluster pair distances that are used for cluster distance comparison. In summary, SparseHC is an online hierarchical clustering tool which can perform accurate single-, complete- and average-linkage hierarchical clustering with linear empirical memory complexity, making it particularly useful to cluster large datasets using computers with limited memory resources (see Table 1). SparseHC is available at https://bitbucket.com/ngthuydiem/sparsehc and described in more detail here: http://www.sciencedirect.com/science/article/pii/S1877050914001781.

Number of data points	Matrix size (in GB)	Memory usage (in MB)			Memory efficiency of SparseHC
Number of data points	Matrix size (in GB)	Single	Complete	Average	Single	Complete	Average
50,000	14	7.0	30.5	44.9	2,055	469	318
100,000	56	12.2	60.0	90.2	4,673	954	635
150,000	126	17.7	89.6	143.4	7,272	1,437	897
200,000	224	22.9	119.4	198.9	10,013	1,917	1,151

Table 1: The memory efficiency of SparseHC, presented by the ratio matrix size to memory usage.

Another important feature of CRiSPy is the usage of a dynamic method to automatically determine the distance threshold based on a single change point detection method to obtain the natural grouping of a dataset. Due to this dynamic dendrogram cutting, CRiSPy produces better grouping results than most existing OTU binning tools including QIIME, CD-HIT-OTU, UPARSE, ESPRIT-Tree and ESPRIT in terms of the adjusted rand index (ARI) score (a random labeling in which the predicted grouping is very different from the actual grouping results in an ARI score close to 0.0 while a perfect labeling has a score of 1.0). Table 2 shows the results of a benchmark test.

	CRiSPy	M-Pick	ESPRIT	ESPRIT-Tree	QIIME	UPARSE	CD-HIT-OTU
ARI-score	0.86	0.42	0.57	0.53	0.57	0.57	0.55
Runtime (in sec)	4	61	1,956	31	3	33	2
Memory usage (in MB)	47	2,330	76	252	370	11	50

Table 2: Benchmark of all OTU clustering tools using 10 randomly-generated datasets with 5,000 16S rRNA reads from 100 species. Means from 10 runs are reported. M-pick and CRiSPy are dynamic grouping tools. ESPRIT and ESPRIT-Tree are static hierarchical clustering tools. QIIME (pick_otus v1.8), UPARSE (in USEARCH v7.0) and CD-HIT-OTU (v0.0.2) are static greedy heuristic clustering tools. Experiments are conducted on a 64-bit Linux OS using a PC with a quad-core Intel Xeon W3540 CPU and 8GB RAM. CRiSPy runs on a GeForce GTX 580 installed on the same PC.

CRiSPy is capable of producing high-quality clusters and is scalable to several hundreds of thousands of reads on our workstation with only 8GB of RAM due to its memory and computational efficiency. Although targeting 16S rRNA short reads for OTU binning of metagenomic datasets, the utilized alignment and clustering techniques could potentially be applied to other types of sequence data expressed sequence tags or protein sequences) or to other hierarchical tree construction problems (such as phylogenetic tree or guide tree construction). Furthermore, SparseHC can be a useful tool for a variety of big data application areas that need to perform memory-efficient hierarchical clustering from a (possibly sparse) distance matrix based on single-, complete- or average-linkage schemes.

Click for more information about Professor Bertil Schmidt

Bertil Schmidt is tenured Full Professor and Chair for Parallel and Distributed Architectures at the University of Mainz, Germany. Prior to that he was a faculty member at Nanyang Technological University (Singapore) and at University of New South Wales (UNSW). His research group has designed a variety of algorithms and tools for Bioinformatics mainly focusing on the analysis of large-scale datasets. For his research work, he has received a CUDA Research Centre award, a CUDA Academic Partnership award, a CUDA Professor Partnership award and the Best Paper Award at IEEE ASAP 2009. Furthermore, he serves as the champion for Bioinformatics and Computational Biology on gpucomputing.net.

Share this:

Leave a Reply Cancel reply