We show here that under several accepted measures of deviation from expected frequency, the candidates over- or underrepresented words are restricted to the O(n) words that end at internal nodes of a compact suffix tree, as opposed to the ?(n 2 ) possible substrings. This surprising fact is a consequence of properties in the form that if a word that ends in the middle of an arc is, say, overrepresented, then its extension to the nearest node of the tree is even more so. Based on this, we design global linear detectors of favored and unfavored words for our probabilistic framework, and display the results of some preliminary that apply our constructions to the analysis of genomic sequences.
Citation:
Alberto Apostolico, Stefano Lonardi, Mary Ellen Bock, "Linear Global Detectors of Redundant and Rare Substrings," dcc, pp.168, Data Compression Conference (DCC '99), 1999