2012 IEEE 12th International Conference on Data Mining Workshops (2012)
Brussels, Belgium Belgium
Dec. 10, 2012 to Dec. 10, 2012
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDMW.2012.57
Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagnostic tools. Computational programs that discover these tandem repeats generate a huge volume of data, which is often difficult to decipher without further organization. In this paper, we describe a new method for post-processing tandem repeats through clustering. Our work presents multiple ways of expressing tandem repeats using the n-gram model with different clustering distance measures. Analysis of these clusters for chromosome 1 of the human genomes shows that the clustering of tandem repeats according to 3-grams yields well-defined clusters. Our new, alignment-free method facilitates the analysis of the myriad of tandem repeats that occur in the human genome and we believe that this work will lead to new discoveries on the roles, origins, and significance of tandem repeats.
Genomics, Clustering algorithms, DNA, Biological cells, Humans, Algorithm design and analysis, classification, tandem repeats, n-grams, clustering, human genome
Yupu Liang, Dina Sokol, Sarah Zelikovitz, "Clustering Tandem Repeats via Trinucleotides", 2012 IEEE 12th International Conference on Data Mining Workshops, vol. 00, no. , pp. 64-71, 2012, doi:10.1109/ICDMW.2012.57