There have been huge advances in DNA sequencing technologies in recent years; e.g. Illumina has just announced the HiSeq X system which can sequence human genomes at a cost of only $1000 per genome. Besides population-scale human genome sequencing another important application of sequencing technologies is environmental sequencing (so called metagenomics). Metagenomic studies usually involve the sequencing of a sample of cells collected from their environment (such as soil, marine water, or human gut). Analyzing metagenomic sequencing data is computational challenging because of the diverse mix of sequenced genomes and the large data size.
CRiSPy (Computing Richness in Pyrosequencing Datasets) is a software tool developed by PhD student Nguyen Thuy Diem from Nanyang Technological University (Singapore) and Prof. Bertil Schmidt from the University of Mainz (Germany) that deals with estimating the genomic richness and variation within a microbial sequencing dataset. More specifically, CRiSPy is a de-novo (i.e. reference database independent) clustering technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). The processing pipeline for computing an OTU clustering is based on hierarchical clustering and is shown in Figure 1. CRiSPy is freely available at https://github.com/ngthuydiem/crispy.
CRiSPy-CUDA aims at producing high-quality clustering results at a reasonably low computational cost. In particular, we tackle two computational issues of the OTU hierarchical clustering approach:
- High computational complexity (i.e. the compute-intensive sequence alignment operation for building a pair-wise distance matrix)
- High memory complexity (i.e. the quadratic memory requirement of the hierarchical clustering procedure)
The first issue is addressed by using CUDA-enabled GPUs for parallelizing the two-staged pairwise distance computation (i.e. k-mer distances first (as a computationally cheaper filter) and then the more accurate genetic distances for all pairs with a low k-mer distance). Taking full advantage of fast CUDA shared memory, we are able to achieve speedups of around 100x for the genetic distance computation (based on dynamic programming) and around 50x for the k-mer distance computation on a GeForce GTX 580 compared to a single-core CPU code.
The second issue is addressed by the design of a memory-efficient sparse matrix hierarchical clustering algorithm. Even though we use a sparse representation of the calculated distance matrices, typical sizes of large matrices can still exceed the RAM capacity of a standard workstation. Thus, we have developed a fast and memory-efficient hierarchical clustering algorithm for sparse matrices called SparseHC. SparseHC scans a sorted sparse distance matrix chunk-by-chunk and a dendrogram is built by merging cluster pairs as soon as the distance between them is determined to be the smallest among all remaining cluster pairs. The main insight used is that it is unnecessary to wait for the completion of all cluster pair distance computations for finding the cluster pair with the smallest distance. Partial information can be used to determine lower bounds on cluster pair distances that are used for cluster distance comparison. In summary, SparseHC is an online hierarchical clustering tool which can perform accurate single-, complete- and average-linkage hierarchical clustering with linear empirical memory complexity, making it particularly useful to cluster large datasets using computers with limited memory resources (see Table 1). SparseHC is available at https://bitbucket.com/ngthuydiem/sparsehc and described in more detail here: http://www.sciencedirect.com/science/article/pii/S1877050914001781.
Number of data points | Matrix size (in GB) | Memory usage (in MB) | Memory efficiency of SparseHC | ||||
Single | Complete | Average | Single | Complete | Average | ||
50,000 | 14 | 7.0 | 30.5 | 44.9 | 2,055 | 469 | 318 |
100,000 | 56 | 12.2 | 60.0 | 90.2 | 4,673 | 954 | 635 |
150,000 | 126 | 17.7 | 89.6 | 143.4 | 7,272 | 1,437 | 897 |
200,000 | 224 | 22.9 | 119.4 | 198.9 | 10,013 | 1,917 | 1,151 |
Table 1: The memory efficiency of SparseHC, presented by the ratio matrix size to memory usage.
Another important feature of CRiSPy is the usage of a dynamic method to automatically determine the distance threshold based on a single change point detection method to obtain the natural grouping of a dataset. Due to this dynamic dendrogram cutting, CRiSPy produces better grouping results than most existing OTU binning tools including QIIME, CD-HIT-OTU, UPARSE, ESPRIT-Tree and ESPRIT in terms of the adjusted rand index (ARI) score (a random labeling in which the predicted grouping is very different from the actual grouping results in an ARI score close to 0.0 while a perfect labeling has a score of 1.0). Table 2 shows the results of a benchmark test.
CRiSPy | M-Pick | ESPRIT | ESPRIT-Tree | QIIME | UPARSE | CD-HIT-OTU | |
ARI-score | 0.86 | 0.42 | 0.57 | 0.53 | 0.57 | 0.57 | 0.55 |
Runtime (in sec) | 4 | 61 | 1,956 | 31 | 3 | 33 | 2 |
Memory usage (in MB) | 47 | 2,330 | 76 | 252 | 370 | 11 | 50 |
Table 2: Benchmark of all OTU clustering tools using 10 randomly-generated datasets with 5,000 16S rRNA reads from 100 species. Means from 10 runs are reported. M-pick and CRiSPy are dynamic grouping tools. ESPRIT and ESPRIT-Tree are static hierarchical clustering tools. QIIME (pick_otus v1.8), UPARSE (in USEARCH v7.0) and CD-HIT-OTU (v0.0.2) are static greedy heuristic clustering tools. Experiments are conducted on a 64-bit Linux OS using a PC with a quad-core Intel Xeon W3540 CPU and 8GB RAM. CRiSPy runs on a GeForce GTX 580 installed on the same PC.
CRiSPy is capable of producing high-quality clusters and is scalable to several hundreds of thousands of reads on our workstation with only 8GB of RAM due to its memory and computational efficiency. Although targeting 16S rRNA short reads for OTU binning of metagenomic datasets, the utilized alignment and clustering techniques could potentially be applied to other types of sequence data expressed sequence tags or protein sequences) or to other hierarchical tree construction problems (such as phylogenetic tree or guide tree construction). Furthermore, SparseHC can be a useful tool for a variety of big data application areas that need to perform memory-efficient hierarchical clustering from a (possibly sparse) distance matrix based on single-, complete- or average-linkage schemes.
Bertil Schmidt is tenured Full Professor and Chair for Parallel and Distributed Architectures at the University of Mainz, Germany. Prior to that he was a faculty member at Nanyang Technological University (Singapore) and at University of New South Wales (UNSW). His research group has designed a variety of algorithms and tools for Bioinformatics mainly focusing on the analysis of large-scale datasets. For his research work, he has received a CUDA Research Centre award, a CUDA Academic Partnership award, a CUDA Professor Partnership award and the Best Paper Award at IEEE ASAP 2009. Furthermore, he serves as the champion for Bioinformatics and Computational Biology on gpucomputing.net.
Leave a Reply