hc-OTU: A Fast and Accurate Method for Clustering Operational Taxonomic Unit based on Homopolymer Compaction
Seunghyun Park , Seunghyun Park is with the Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Republic of Korea and is also with the School of Electrical Engineering, Korea University, Seoul 02841, Republic of Korea. (email: firstname.lastname@example.org)
To assess the genetic diversity of an environmental sample in metagenomics studies, the amplicon sequences of 16s rRNA genes need to be clustered into operational taxonomic units (OTUs). Many existing tools for OTU clustering trade off between accuracy and computational efficiency. We propose a novel OTU clustering algorithm, hc-OTU, which achieves high accuracy and fast runtime by exploiting homopolymer compaction and k-mer profiling, to significantly reduce computing time for pairwise distances of amplicon sequences. We compare the proposed method with other widely used methods, including UCLUST, CD-HIT, MOTHUR, ESPRIT, ESPRIT-TREE, and CLUSTOM comprehensively, using nine different experimental datasets and many evaluation metrics, such as normalized mutual information, adjusted rand index, measure of concordance and F-score. Our evaluation reveals that the proposed method achieves accuracy comparable to those of MOTHUR and ESPRIT-TREE, two widely used OTU clustering methods, with orders of magnitude speed-up.
16s rRNA, clustering algorithm, operational taxonomic unit (OTU), pyrosequencing, metagenomics
S. Park, H. Choi, B. Lee, J. Chun, J. Won and S. Yoon, "hc-OTU: A Fast and Accurate Method for Clustering Operational Taxonomic Unit based on Homopolymer Compaction," in IEEE/ACM Transactions on Computational Biology and Bioinformatics.