Issue No.01 - Jan.-Feb. (2013 vol.10)
pp: 200-206
Kristian Ovaska , Genome-Scale Biol. & Inst. of Biomed., Univ. of Helsinki, Helsinki, Finland
Lauri Lyly , Genome-Scale Biol. & Inst. of Biomed., Univ. of Helsinki, Helsinki, Finland
Biswajyoti Sahu , Inst. of Biomed., Univ. of Helsinki, Helsinki, Finland
Olli A. Janne , Inst. of Biomed., Physiol., Biomedicum, Univ. of Helsinki, Helsinki, Finland
Sampsa Hautaniemi , Genome-Scale Biol. & Inst. of Biomed., Univ. of Helsinki, Helsinki, Finland
Computational analysis of data produced in deep sequencing (DS) experiments is challenging due to large data volumes and requirements for flexible analysis approaches. Here, we present a mathematical formalism based on set algebra for frequently performed operations in DS data analysis to facilitate translation of biomedical research questions to language amenable for computational analysis. With the help of this formalism, we implemented the Genomic Region Operation Kit (GROK), which supports various DS-related operations such as preprocessing, filtering, file conversion, and sample comparison. GROK provides high-level interfaces for R, Python, Lua, and command line, as well as an extension C++ API. It supports major genomic file formats and allows storing custom genomic regions in efficient data structures such as red-black trees and SQL databases. To demonstrate the utility of GROK, we have characterized the roles of two major transcription factors (TFs) in prostate cancer using data from 10 DS experiments. GROK is freely available with a user guide from
Bioinformatics, Genomics, Databases, Benchmark testing, Algebra, Software, Complexity theory,software, Bioinformatics, deep sequencing, genomic data analysis, region set algebra
Kristian Ovaska, Lauri Lyly, Biswajyoti Sahu, Olli A. Janne, Sampsa Hautaniemi, "Genomic Region Operation Kit for Flexible Processing of Deep Sequencing Data", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 1, pp. 200-206, Jan.-Feb. 2013, doi:10.1109/TCBB.2012.170
[1] M. Metzker, “Sequencing Technologies---The Next Generation,” Nature Rev. Genetics, vol. 11, no. 1, pp. 31-46, 2009.
[2] J. McPherson, “Next-Generation Gap,” Nature Methods, vol. 6, pp. S2-S5, 2009.
[3] S. Pepke, B. Wold, and A. Mortazavi, “Computation for ChIP-Seq and RNA-Seq Studies,” Nature Methods, vol. 6, pp. S22-S32, 2009.
[4] R. Nielsen, J. Paul, A. Albrechtsen, and Y. Song, “Genotype and SNP Calling from Next-Generation Sequencing Data,” Nature Rev. Genetics, vol. 12, no. 6, pp. 443-451, 2011.
[5] J. Goecks et al., “Galaxy: A Comprehensive Approach for Supporting Accessible, Reproducible, and Transparent Computational Research in the Life Sciences,” Genome Biology, vol. 11, no. 8, p. R86, 2010.
[6] M. Fiume, V. Williams, A. Brook, and M. Brudno, “Savant: Genome Browser for High-Throughput Sequencing Data,” Bioinformatics, vol. 26, no. 16, pp. 1938-1944, 2010.
[7] A. Quinlan and I. Hall, “BEDTools: A Flexible Suite of Utilities for Comparing Genomic Features,” Bioinformatics, vol. 26, no. 6, pp. 841-842, 2010.
[8] R. Dale, B. Pedersen, and A. Quinlan, “Pybedtools: A Flexible Python Library for Manipulating Genomic Datasets and Annotations,” Bioinformatics, vol. 27, pp. 3423-3424, 2011.
[9] S. Neph et al., “BEDOPS: High-Performance Genomic Feature Operations,” Bioinformatics, vol. 28, no. 14, pp. 1919-1920, 2012.
[10] H. Li, “Tabix: Fast Retrieval of Sequence Features from Generic TAB-Delimited Files,” Bioinformatics, vol. 27, no. 5, pp. 718-719, 2011.
[11] E. Wilbanks and M. Facciotti, “Evaluation of Algorithm Performance in ChIP-Seq Peak Detection,” PLoS ONE, vol. 5, no. 7, p. e11471, 2010.
[12] H. Li et al., “The Sequence Alignment/Map Format and SAMtools,” Bioinformatics, vol. 25, no. 16, pp. 2078-2079, 2009.
[13] P. Cock, C. Fields, N. Goto, M. Heuer, and P. Rice, “The Sanger FASTQ File Format for Sequences with Quality Scores, and the Solexa/Illumina FASTQ Variants,” Nucleic Acids Research, vol. 38, no. 6, pp. 1767-1771, 2010.
[14] P. Danecek et al., “The Variant Call Format and VCFtools,” Bioinformatics, vol. 27, no. 15, pp. 2156-2158, 2011.
[15] T. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algorithms. MIT Press, 2001.
[16] D. Beazley et al., “SWIG: An Easy to Use Tool for Integrating Scripting Languages with C and C++,” Proc. Fourth USENIX Tcl/Tk Workshop, pp. 129-139, 1996.
[17] The ENCODE Project Consortium, “An Integrated Encyclopedia of DNA Elements in the Human Genome,” Nature, vol. 489, no. 7414, pp. 57-74, 2012.
[18] C. Heinlein and C. Chang, “Androgen Receptor in Prostate Cancer,” Endocrine Rev., vol. 25, no. 2, pp. 276-308, 2004.
[19] T. Visakorpi, E. Hyytinen, P. Koivisto, M. Tanner, R. Keinänen, C. Palmberg, A. Palotie, T. Tammela, J. Isola, and O. Kallioniemi, “In Vivo Amplification of the Androgen Receptor Gene and Progression of Human Prostate Cancer,” Nature Genetics, vol. 9, no. 4, pp. 401-406, 1995.
[20] C. Chen, D. Welsbie, C. Tran, S. Baek, R. Chen, R. Vessella, M. Rosenfeld, and C. Sawyers, “Molecular Determinants of Resistance to Antiandrogen Therapy,” Nature Medicine, vol. 10, pp. 33-39, 2004.
[21] G. Bernardo and R. Keri, “FOXA1: A Transcription Factor with Parallel Functions in Development and Cancer,” Bioscience Reports, vol. 32, no. 2, pp. 113-130, 2012.
[22] B. Sahu et al., “Dual Role of Foxa1 in Androgen Receptor Binding to Chromatin, Androgen Signalling and Prostate Cancer,” EMBO J., vol. 30, no. 19, pp. 3962-3976, 2011.
[23] Y. Zhang et al., “Model-Based Analysis of ChIP-Seq (MACS),” Genome Biology, vol. 9, no. 9, p. R137, 2008.
[24] J. Bryne, E. Valen, M. Tang, T. Marstrand, O. Winther, I. Da Piedade, A. Krogh, B. Lenhard, and A. Sandelin, “JASPAR, the Open Access Database of Transcription Factor-Binding Profiles: New Content and Tools in the 2008 Update,” Nucleic Acids Research, vol. 36, no. suppl 1, pp. D102-D106, 2008.
[25] X. Xie, J. Lu, E. Kulbokas, T. Golub, V. Mootha, K. Lindblad-Toh, E. Lander, and M. Kellis, “Systematic Discovery of Regulatory Motifs in Human Promoters and 3' UTRs by Comparison of Several Mammals,” Nature, vol. 434, no. 7031, pp. 338-345, 2005.
[26] C. Dang, “MYC on the Path to Cancer,” Cell, vol. 149, no. 1, pp. 22-35, 2012.
[27] M. Krzywinski, J. Schein, İ. Birol, J. Connors, R. Gascoyne, D. Horsman, S. Jones, and M. Marra, “Circos: An Information Aesthetic for Comparative Genomics,” Genome Research, vol. 19, no. 9, pp. 1639-1645, 2009.
[28] W. Kent, A. Zweig, G. Barber, A. Hinrichs, and D. Karolchik, “BigWig and BigBed: Enabling Browsing of Large Distributed Datasets,” Bioinformatics, vol. 26, no. 17, pp. 2204-2207, 2010.
[29] F. Campagne et al., Goby Framework, http:/, 2013.