Pages: pp. 793-798

Abstract—Genome assemblies are typically compared with respect to their contiguity, coverage, and accuracy. We propose a genome-wide, alignment-free genomic distance based on compressed maximal exact matches and suggest adding it to the benchmark of commonly used assembly quality metrics. Maximal exact matches are perfect repeats, without gaps or misspellings, which cannot be further extended to either their left- or right-end side without loss of similarity. The genomic distance here proposed is based on the normalized compression distance, an information-theoretic measure of the relative compressibility of two sequences estimated using multiple finite-context models. This measure exposes similarities between the sequences, as well as, the nesting structure underlying the assembly of larger maximal exact matches from smaller ones. We use four human genome assemblies for illustration and discuss the impact of genome sequencing and assembly in the final content of maximal exact matches and the genomic distance here proposed.

Keywords—Genome sequencing and assembly; maximal exact matches; normalized compression distance

The importance of high-quality genome assemblies for biomedical research can hardly be overstated. As consequence of the reduction in sequencing costs and the increased throughput of sequencing technologies, enabling large-scale sequencing projects of many individuals in a population, or of many populations in an ecosystem, genome assembly is evermore an ubiquitous methodology. However, genome assembly remains one of the most difficult computational problems in genomics, which is particularly challenging for large and repeat-rich genomes, such as those of mammals.

There are numerous algorithms and software packages available for genome assembly and, likewise, a myriad of quality metrics, many of which were evaluated in the first Assemblathon competition [ ^{1}] and, more recently, in, for example, [ ^{2}], [ ^{3}]. Genome assemblies are typically compared based on their contiguity, coverage, and accuracy. The most commonly used metric to assess long-range contiguity is the N50 statistic, often complemented by other metrics such as contig and scaffold size. Coverage quantifies the amount of missing sequence in the draft assembly, using metrics such as the number of contigs and scaffolds, or the alignment to a reference genome sequence. The accuracy of an assembly can be measured in diverse ways, including the number of single nucleotide substitutions, insertion, and deletion structural errors, other large-scale structural errors, copy-number errors, and the assembly of specific annotated genomic regions such as genes and repeats.

Assembly quality metrics that rely on sequence comparison often require some measure of similarity or distance, i.e., a function $D(S_1,S_2)$ that assigns a real number to every pair of sequences $S_1$ and $S_2$ , such that the more similar $S_1$ is to $S_2$ , the smaller the value of $D(S_1,S_2)$ . Let ${\cal A}$ be a finite and ordered set that is called an *alphabet*. Its elements are called *characters* and its cardinality is $\vert {\cal A}\vert$ . A *string* over the alphabet ${\cal A}$ is a finite sequence of elements of ${\cal A}$ . Let ${\cal A}^ \ast$ be the set of all strings over ${\cal A}$ , equipped with the binary operation of concatenation, which is an associative operation. The *empty sequence*$\varepsilon$ is a neutral element for the operation of concatenation. Let $S$ be a string of length $\vert S\vert$ over ${\cal A}$ . A metric is a function $D: {\cal A}^\ast \times {\cal A}^\ast \rightarrow {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^+_0$ that satisfies the properties:

- $D(S_1,S_2) \ge 0$ ,
- $D(S_1,S_2)=0 \; {\rm iff} \; S_1=S_2$ ,
- $D(S_1,S_2)=D(S_2,S_1)$ ,
- $D(S_1,S_2)+D(S_2,S_3)\ge D(S_1,S_3)$ ,

$\forall S_1, S_2, S_3 \in {\cal A}^\ast$ . Examples of distances between pairs of sequences include the subword distance $D(S_1,S_2)=\vert S_1\vert \;{+}$$\vert S_2\vert -2u$ , where $\vert S_1\vert$ and $\vert S_2\vert$ are the lengths of sequences $S_1$ and $S_2$ , respectively, and $u$ is the length of the longest common subword of $S_1$ and $S_2$ [ ^{4}]; the prefix/suffix distance $D(S_1,S_2)=\vert S_1\vert +\vert S_2\vert -2v$ , where $\vert S_1\vert$ and $\vert S_2\vert$ are again the lengths of sequences $S_1$ and $S_2$ , respectively, and $v$ is the length of the longest common prefix/suffix of $S_1$ and $S_2$ [ ^{4}]; or the Levenshtein edit distance, defined as the minimum number of edits needed to transform sequence $S_1$ into sequence $S_2$ , with allowable edit operations being insertion, deletion, or substitution of a single character [ ^{5}]. Regardless of their widespread application in bioinformatics and computational biology, the computation of edit distances for large data sets is still computationally prohibitive. The fastest algorithm for the case of constant-size alphabet remains the algorithm of Masek and Paterson [ ^{6}] that runs in $O(n^2 / \log n)$ time. Moreover, edit distances are scarcely significant when applied to whole-genomes; hence, they are being replaced by global similarity measures that refer, implicitly or explicitly, to the subword composition of the sequences [ ^{7}]. Such distances are currently being explored for the challenge of population genomics [ ^{8}]. For problems of genome evolution, edit and permutation distances that operate on sequence fragments and impart genome rearrangements are key (e.g., [ ^{9}], [ ^{10}]). With these distances, not only the sequence content is evaluated, but also the order by which the different sequence fragments are observed. An interesting application of permutation distances pertains the problem of completing scaffolds from genome assemblies sequenced with high-throughput methodologies by comparison to a reference genome [ ^{11}], [ ^{12}]. Such a step is crucial in improving the quality of a genome assembly, but it does not quantify such improvement.

Because assembling repetitive regions remains the ultimate challenge, we propose a genome-wide, alignment-free genomic distance for assembly comparison based on the normalized compression distance of all unique maximal exact matches in a pair of sequences, and suggest adding it to the benchmark of commonly used assembly quality metrics. Maximal exact matches are perfect repeats, without gaps or misspellings, which cannot be further extended to either their left- or right-end side without loss of similarity. From a concept originally devised for string processing problems in the context of computer science, maximal exact matches have been applied to similar problems in the realm of computational biology, namely, for sequence analysis [ ^{13}]. They are key in massively parallel sequencing for seeding alignment of sequencing reads in genome assembly, and as anchor points in comparisons of closely related genomes [ ^{14}]. Some of the earlier applications of suffix trees and suffix arrays for finding maximal exact matches include the MUMmer algorithms [ ^{15}], [ ^{16}], [ ^{17}] and the REPuter software package [ ^{18}], whereas, recent additions include using sparse suffix arrays for efficiency [ ^{14}]. Other designations in the literature include maximal exact repeats [ ^{13}], maximal unique matching subsequences [ ^{15}], and perfect repeats [ ^{19}].

The normalized compression distance is part of a plethora of information-theoretic measures of the degree of similarity between two sequences based on their relative compressibility. The idea is that similar sequences are expected to contain a larger number of common subsequences, and therefore should be more compressible than distant sequences. The normalized compression distance is here estimated using multiple finite-context models. It exposes similarities between the sequences, as well as, the nesting structure underlying the assembly of larger maximal exact matches from smaller ones. An important advantage of these information-based similarity measures is that they do not require a feature extraction step.

The genomic distance here proposed is not an evolutionary distance. As an alignment-free distance, it is not based on homology, hence, it is not easily convertible to an evolutionary distance [ ^{20}] (though recent efforts have attempted a link between both [ ^{21}]). The genomic distance presented is proposed for comparing genome assemblies and, as such, will tendentiously be smaller between genome assemblies of equivalent rather than diverging quality, an effect that may even superimpose a larger evolutionary distance.

Let $S[p]$ be the $p$ th character of string $S$ , with $1 \le p \le \vert S\vert$ . A substring of $S$ starting at position $p_1$ and ending at position $p_2$ is denoted by $S[p_1..p_2]$ , with $p_1 \le p_2$ . If $p_1 = p_2 = p$ , then $S[p..p] \equiv S[p]$ . Moreover, $lS$ ( $Sr$ ) denotes the concatenation of character $l$ ( $r$ ) to the left- (right-) end side of $S$ , with $l,r \in {\cal A}$ . For convenience, consider also two additional characters, $\#$ and $\$$ , that do not belong to the alphabet ${\cal A}$ . By definition, the character to the left of the first character of string $S$ is $\#$ , i.e., $S[0] = \#$ , while the character to the right of the last character of string $S$ is $\$$ , i.e., $S[\vert S\vert +1] = \$$ (this definition of $S[0]$ and $S[\vert S\vert +1]$ incurs in a few formal, though innocuous, inconsistencies, ignored here for the sake of simplicity).

A maximal repeated pair in $S$ is a pair of identical substrings ( $S[p_1..p_1+\vert \alpha \vert -1] = S[p_2..p_2+\vert \alpha \vert -1] = \alpha$ ) such that the character to the immediate left (right) of one of the substrings is different from the character to the immediate left (right) of the other substring ( $S[p_1-1] \ne S[p_2-1]$ and $S[p_1+\vert \alpha \vert ] \ne S[p_2+\vert \alpha \vert ]$ ). It is represented by a triple ( $p_1,p_2,\vert \alpha \vert$ ), where $p_1$ and $p_2$ are the starting positions of the two substrings, with $p_1 \ne p_2$ . A substring $\alpha$ is a *maximal exact match* of $S$ if there is at least a maximal repeated pair in $S$ of the form ( $p_1,p_2,\vert \alpha \vert$ ) [ ^{13}]. Hence, by definition, each maximal exact match occurs at least twice, and a long maximal exact match may contain smaller maximal exact matches within itself. These definitions have been previously put forward elsewhere [ ^{22}], [ ^{23}], [ ^{24}] but are here included again for the sake of self-containment.

We use four human genome assemblies as proof-of-concept. The human reference genome assembly GRCh37 build 37.1, available at the National Center for Biotechnology Information (NCBI) website, is used as the gold standard for the intraspecies comparisons detailed. This is an upgrade on the initial human reference genome sequenced with hierarchical shotgun capillary-based methodologies [ ^{25}], [ ^{26}], which has been carefully revised and corrected throughout the last decade (e.g., [ ^{27}], [ ^{28}]), thus significantly reducing the number of gaps from an initial value of ${\sim}150,000$ to ${\sim}250$ [ ^{29}]. We also use the May 2007 HuRef assembly of the genome of J. Craig Venter, sequenced with capillary-based whole-genome shotgun technologies using the Applied Biosystems 3730xl DNA analyzer and *de novo* assembled with the Celera Assembler [ ^{30}], which is also available at the NCBI website. We use the NA12878 assembly of DNA from the cell line GM12878, sequenced with massively parallel sequencing technologies using Illumina Genome Analyzers and assembled with the ALLPATHS-LG program [ ^{31}], whose unplaced scaffolds are available at the GenBank website [ ^{32}]. Finally, we use the YH assembly of the genome of a Han Chinese, also sequenced with massively parallel sequencing technologies using Illumina Genome Analyzers and assembled with the SOAPdenovo assembler [ ^{33}], whose unplaced scaffolds are available at the BGI-Shenzhen website [ ^{34}]. All chromosomes (GRCh37 and HuRef assemblies) or scaffolds (NA12878 and YH assemblies) are initially concatenated using a delimiting character that does not belong to the alphabet $\{{\rm A},{\rm C},{\rm G},{\rm T}\}$ , to avoid artificial matches across boundaries. Moreover, we ignore all sequence ambiguities by replacing any subsequence of nucleotides not in $\{{\rm A},{\rm C},{\rm G},{\rm T}\}$ with the same delimiting character. Because this delimiting character is processed as a string terminator, the order by which the chromosomes or scaffolds are inserted is irrelevant and does not affect the results.

To compute the genomic distance based on compressed maximal exact matches, we start by discovering all maximal exact matches in each genome assembly, using the algorithm detailed in Appendix A (see the supplemental material, which can be found on the Computer Society Digital Library at http://doi. ieeecomputersociety.org/10.1109/TCBB.2013.77). We discard information about the copy number, or multiplicity, of the maximal exact matches discovered, and refer to this set of single-copy words as all unique maximal exact matches.

An overall analysis of the distributions of all unique maximal exact matches is likely to mask important details, as the distributions are highly nonstationary. Hence, we divide the distributions in four ranges of maximal exact match lengths and define sets containing all maximal exact matches of length within the range. Set S1 contains all maximal exact matches of length 1 bp $\le \vert \alpha \vert < 10$ bp, set S10 contains all maximal exact matches of length 10 bp $\le \vert \alpha \vert < 100$ bp, set S100 contains all maximal exact matches of length 100 bp $\le \vert \alpha \vert < 1{,}000$ bp, and set S1,000 contains all maximal exact matches of length 1,000 bp $\le \vert \alpha \vert < 10{,}000$ bp, where unit bp stands for base pairs. For simplicity of notation, the S in the set name will be, whenever appropriate, replaced by the first letter of the genome assembly it refers to. For example, set G1 contains all maximal exact matches of length 1 to 9 bp in the GRCh37 assembly.

We compress these sets of maximal exact matches using the methodology of finite-context models described in Appendix D, available in the online supplemental material. We then assess their similarity by computing the normalized compression distance (NCD) in (3) of Appendix C, available in the online supplemental material, between each pair of sets of maximal exact matches.

Finally, we define the genomic distance based on compressed maximal exact matches between two genomic sequences as

$$\Omega (S_1,S_2) = {1\over 2\upsilon^2} \sum _{i=1}^{\upsilon^2} \Big( {\rm NCD}_i(S_1,S_2)+ {\rm NCD}_i(S_2,S_1)\Big),$$(1)

where parameter $\upsilon$ is the number of ranges of maximal exact match lengths, with a value of four for all genome assemblies in this study. It is possible to compare sequences that are divided into different number of sets, by correcting the denominator to the product of both numbers, instead of $\upsilon^2$ . The genomic distance based on compressed maximal exact matches is, therefore, an average normalized compression distance between sequences $S_1$ and $S_2$ estimated over all sets of maximal exact matches in each sequence. However, unlike the normalized compression distance, $\Omega$ is symmetric by construction, i.e., $\Omega (S_1,S_2)= \Omega (S_2,S_1)$ .

An implementation of this genomic distance is freely available for noncommercial use in a software package called “Quality of Assemblies by Repeat Compression” at http:// bioinformatics.ua.pt/software/quarc/.

Fig. 1 displays the distribution of all unique maximal exact matches as a function of the maximal exact match length in units of base pairs (bp) for each human genome assembly. For small lengths, the number of maximal exact matches increases exponentially with $4^k$ , where 4 is the alphabet cardinality and $k$ is the length of the maximal exact match, as all $4^k$ possible words occur and are maximal exact matches. Close to and beyond the inflexion point (at length 16 bp), the number of maximal exact matches becomes smaller than the number of $4^k$ possible words. The rate of decrease in the number of maximal exact matches beyond this inflexion points is initially exponential (until maximal exact repeat lengths of ${\sim} 20$ bp), followed by a subexponential rate in the right heavy-tails. Contrasting these distributions to those of random strings highlights the absence of heavy-tails in the latter, for all sequence lengths and alphabet cardinalities tested (Fig. 3 in Appendix B, available in the online supplemental material). In particular, random strings of length $10^9$ characters in a four-letter alphabet, which mimic the genome size and alphabet cardinality of the human genome, have maximal exact matches with a maximum length of ${\sim} 30$ bp. Hence, the presence of right heavy-tails is an exclusive property of genomic sequences.

Figure Fig. 1. Distribution of the number of unique maximal exact matches as a function of the maximal exact match length in units of base pairs (bp) in four human genome assemblies.

Table 1 summarizes the statistics of the distributions of maximal exact matches in each human genome assembly. *Size* refers to the size of the accessible genome, which is the number of unambiguous nucleotides (A,C,G or T) in units of base pairs (bp). The total *number* of unique maximal exact matches does not consider their copy number, or multiplicity. The columns *minLen*, *medLen,* and *maxLen* display, respectively, the minimum, median, and maximum length of the maximal exact matches in the genome assembly, in units of base pairs. The columns *minCN* and *maxCN* are the minimum and maximum copy number, respectively, in the genome assembly. For completeness, we looked into the variation in copy number as a function of the maximal exact match length and observed similar results for all four human genome assemblies. For very small and very large maximal exact matches, the copy number variance is small, with the former having consistently high copy number and the latter having very small copy numbers. In the middle of the distribution of maximal exact match lengths, the copy number variance is significant, implying that for a given length, some maximal exact matches will have small copy number, where others will have very large copy numbers. Results for random strings are similar in the ranges of small and large maximal exact match lengths, but the copy number variance of maximal exact matches of median length is much smaller in random strings than in genomic sequences.

Table 1. Summary Statistics of the Distributions of Maximal Exact Matches in Four Human Genome Assemblies

Because overall statistics may mask important details, particularly in nonstationary distributions, we investigated different sets of maximal exact matches separately (Section 2.3). Table 2 displays the number of unique maximal exact matches per set in each human genome assembly. The number of maximal exact matches in the smaller range of lengths (sets S1) is equal in all human genome assemblies. This similarity is increasingly lost in ranges of increasingly longer maximal exact matches.

Table 2. Number of Unique Maximal Exact Matches per Set in Four Human Genome Assemblies

The data in Fig. 1 and Tables 1 and 2 highlight well-known difficulties in the *de novo* assembly of large repetitive regions. The analysis of the total number of maximal exact matches, their length, and copy number, using the reference GRCh37 assembly as gold standard, shows a decreasing tendency as one moves from the capillary-based HuRef assembly, to the massively parallel NA12878 and YH assemblies. The explanation for this decrease lies with both the sequencing technology and the assembly algorithm. GRCh37 and HuRef were both sequenced with capillary-based methodologies, though using different approaches, but the former has been continuously revised and corrected, rendering it the most complete and accurate human genome sequence so far (HuRef covers ${\sim}97$ percent of GRCh37). NA12878 and YH were both sequenced with massively parallel methodologies, but the former presents a larger and richer content in maximal exact matches, which is in accordance with the claim by the authors that the quality of this NA12878 assembly (which only covers ${\sim}91$ percent of GRCh37) approaches that of capillary-based sequencing assemblies [ ^{31}]. A key factor for explaining the larger number of median and large maximal exact matches in the NA12878 assembly is that it benefited from a large fosmid library of variable-length inserts, including 40 kilobases (kb). Clearly, the assembly methodology also influences the final genome assembly, and ALLPATHS-LG fared better than SOAPdenovo in the Assemblathon competition [ ^{1}].

We compressed the sets of unique maximal exact matches using eight finite-context models with context depths $\{2, 4, 6, 8, 10, 12{,}$$14, 16 \}$ (Appendix D, available in the online supplemental material). Sets of maximal exact matches were sorted lexicographically before compression. Probabilities were estimated with $\delta =1/20$ in (4) in the case of larger contexts (14 and 16) and the default value of $\delta =1$ otherwise [ ^{35}]. Inverted repeats were considered by selecting the respective flag. Our experience in using multiple finite-context models for compressing long genomic sequences of repeat-rich genomes such as the human genome, points at this being a good combination of parameters for ensuring a good final compression.

Fig. 2 displays the normalized compression distance (3) between sets of maximal exact matches as colormaps. Smaller normalized compression distance values imply greater sequence similarity, whereas larger normalized compression distance values imply greater sequence dissimilarity or difference. As there is some dependence on the compressor used, the matrices displayed in Fig. 2 are not symmetrical (e.g., ${\rm NCD}({\rm G1},{\rm H1}) \ne {\rm NCD}({\rm H1},{\rm G1})$ ). Moreover, the values of the normalized compression distance are slightly larger than zero for comparisons of the same set (e.g., ${\rm NCD}({\rm G1},{\rm G1})>0$ ) due to some overheads of the coding scheme. Likewise, the values of the normalized compression distance may be slightly larger than one for very dissimilar sets.

Figure Fig. 2. Normalized compression distance between sets of maximal exact matches in four human genome assemblies.

The data in Fig. 2 highlight the nesting structure underlying the assembly of larger maximal exact matches from smaller ones. For example, compare the shade of red in the squares corresponding to the comparison of sets G1 and G1000. Clearly, knowing the small maximal exact matches previously (light red square in the first row of the fourth column) improves the coding of the larger maximal exact matches (the converse scenario is represented by a dark red square in the fourth row of the first column). Fig. 2 also exposes similarities in maximal exact matches between genome assemblies, as exemplified by the light blue squares corresponding to the comparison of sets G1 and H1 (first row of the fifth column), sets G1 and N1 (first row of the ninth column), and sets G1 and Y1 (first row of the 13th column). If the normalized compression distance had not been assessed by length range but globally, all detail would have been lost.

The normalized compression distances ((3) and Fig. 2) were used to estimate the genomic distance for assembly comparison based on compressed maximal exact matches in (1), whose values are displayed in Table 3. The HuRef assembly has the second largest content in maximal exact matches ( Table 1) and the greatest sequence similarity to the reference GRCh37 assembly, followed by the NA12878 and YH assemblies, respectively. This ranking makes evident the desirable direct dependency of the genomic distance here proposed on the overall content in maximal exact matches of the genome assemblies being compared. Two key factors will impart this overall content. The first is organismal and genome evolution, as two individuals from the same species will share a larger number and content of genomic features than those from different species. The second factor is current limitations of sequencing technologies and assembly algorithms that hinder the uncovering of repetitive regions in the draft assemblies.

Table 3. The Genomic Distance Based on Compressed Maximal Exact Matches between Four Human Genome Assemblies

The genomic distance here proposed for assembly comparison may be used with data other than whole-genome assemblies. We use data made available by a recent critical assessment of genome assemblies and assembly algorithms as an example [ ^{2}]. The data consist of 10 different assemblies of the human chromosome 14 and we consider three ranges of maximal exact match lengths, namely, sets S1, S10, and S100. As before, we do not consider length ranges for which some sets of maximal exact matches would be empty. Table 4 displays the genomic distance for assembly comparison based on compressed maximal exact matches (1) resulting from the comparison of the reference GRCh37 assembly of this chromosome with other nine assemblies. In accordance with the findings reported in the aforementioned study, for which “ALLPATHS-LG and CABOG stood out as assemblers capable of producing both high contiguity and high accuracy” assemblies [ ^{2}], the genomic distance here proposed outputs CABOG and ALLPATHS-LG as the chromosome assemblies more similar to the reference GRCh37 assembly.

Table 4. The Genomic Distance Based on Compressed Maximal Exact Matches for Human Chromosome 14 between the Reference GRCh37 Assembly and Other Nine Assemblies

We have proposed a genomic distance for assembly comparison based on the normalized compression distance of all unique maximal exact matches in two genome sequences. The normalized compression distance measures the improvement in the compression of a given sequence by previously knowing another sequence through the concatenation of both strings. As discussed in Appendix C, available in the online supplemental material, the normalized compression distance and, by consequence, the genomic distance here proposed, depend on the compressor used. However, consistency in the use of an optimized compressor throughout all comparisons will minimize this potential caveat. The same rationale should be applied to the choice of compression models (Appendix D, available in the online supplemental material). Moreover, string operations other than the concatenation of both full-length strings, such as a random mixture of subsequences from both sequences, are also interesting to consider and will be explored in future work.

The genomic distance proposed is aimed at the comparison of genome assemblies. By providing a quantification of missing repeats in a genome assembly, as well as their identification, this measure is suited as a quality metric of both coverage and accuracy. Repeat analysis in the first Assemblathon competition [ ^{1}] was based on simulated reads. Though this is the appropriate setting for comparing assemblies, as the “correct” answer is known, it has the limitation of conveying no information on how the various assembly methods and algorithms fare with real, large, and repeat-rich data sets. Therefore, we suggest adding this genomic distance to the benchmark of commonly used assembly quality metrics [ ^{1}], to complement other quality metrics based on the analysis of specific genomic regions (e.g., [ ^{42}]). Studies that comprehensively assess genome assemblies often report the dispersion of quality metrics and lack of consistency of the evaluations [ ^{1}], [ ^{2}], [ ^{3}]. The rankings of assemblies here obtained are concordant with overall assembly evaluations; hence, this quality metric does not contradict other measures of contiguity.

Finally, though the rationale for the genomic distance here proposed depends on intraspecies data and the existence of a “gold standard” to which assemblies are compared, a high-quality assembly of an evolutionary-close organism may be used in the absence of such standard.

This work was supported by project grants FCOMP-01-0124-FEDER-010095, FCOMP-01-0124-FEDER-022682, and Incentivo/EEI/UI0127/2013 of the Portuguese Foundation for Science and Technology (Fundação para a Ci $\hat{\rm e}$ ncia e a Tecnologia), by the Operational Programme Thematic Factors of Competitiveness (Programa Operacional Factores de Competitividade, COMPETE), by the National Strategic Reference Framework (Quadro de Refer $\hat{\rm e}$ ncia Estratégico Nacional, QREN), and by the European Regional Development Fund (Fundo Europeu de Desenvolvimento Regional, FEDER). SS was funded by a grant from Project FCOMP-01-0124-FEDER-010095. SPG acknowledges funding from the European Social Fund and the Portuguese Ministry of Education and Science.

- 1. D. Earl, et al., “Assemblathon 1: A Competitive Assessment of De Novo Short Read Assembly Methods,”
*Genome Research,*vol. 21, pp. 2224-2241, 2011. - 2. S. Salzberg, et al., “GAGE: A Critical Evaluation of Genome Assemblies and Assembly Algorithms,”
*Genome Research,*vol. 22, pp. 557-567, 2012. - 3. K. Bradnam, et al., “Assemblathon 2: Evaluating De Novo Methods of Genome Assembly in Three Vertebrate Species,”
*Rapid Post,*http://arxiv.org/abs/1301.5406, 2013. - 4. M. Lothaire,
*Combinatorics on Words.*Cambridge Univ. Press, 1997. - 5. V.I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals,”
*Soviet Physics Doklady,*vol. 10, pp. 707-710, 1966. - 6. M.J. Masek, and M. Paterson, “A Faster Algorithm for Computing String Edit Distances,”
*J. Computer and System Sciences,*vol. 20, pp. 18-31, 1980. - 7. A. Apostolico, “Maximal Words in Sequence Comparisons Based on Subword Composition,”
*Algorithms and Applications,*pp. 34-44, Springer-Verlag, 2010. - 8. B. Haubold, and P. Pfaffelhuber, “Alignment-Free Population Genomics: An Efficient Estimator of Sequence Diversity,”
*Genes, Genomes, Genetics,*vol. 2, pp. 883-889, 2012. - 9. J.-P.D.J.-S. Varré, and E. Rivals, “Transformation Distances: A Family of Dissimilarity Measures Based on Movements of Segments,”
*Bioinformatics,*vol. 15, pp. 194-202, 1999. - 10. D. Sankoff, “Genome Rearrangement with Gene Families,”
*Bioinformatics,*vol. 15, pp. 909-917, 1999. - 11. A. Muñoz, C. Zheng, Q. Zhu, V. Albert, S. Rounsley, and D. Sankoff, “Scaffold Filling, Contig Fusion and Comparative Gene Order Inference,”
*BMC Bioinformatics,*vol. 11, article 304, 2010. - 12. H. Jiang, C. Zheng, D. Sankoff, and B. Zhu, “Scaffold Filling under the Breakpoint and Related Distances,”
*IEEE Trans. Computational Biology and Bioinformatics,*vol. 9, no. 4, pp. 1220-1229, July 2012. - 13. D. Gusfield,
*Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology.*Cambridge Univ. Press, 1997. - 14. Z. Khan, J.S. Bloom, L. Kruglyak, and M. Singh, “A Practical Algorithm for Finding Maximal Exact Matches in Large Sequence Data Sets Using Sparse Suffix Arrays,”
*Bioinformatics,*vol. 25, pp. 1609-1616, 2009. - 15. A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg, “Alignment of Whole Genomes,”
*Nucleic Acids Research,*vol. 27, pp. 2369-2376, 1999. - 16. A.L. Delcher, A. Phillippy, J. Carlton, and S.L. Salzberg, “Fast Algorithms for Large-Scale Genome Alignment and Comparison,”
*Nucleic Acids Research,*vol. 30, pp. 2478-2483, 2002. - 17. S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg, “Versatile and Open Software for Comparing Large Genomes,”
*Genome Biology,*vol. 5, article R12, 2004. - 18. S. Kurtz, J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich, “REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale,”
*Nucleic Acids Research,*vol. 29, pp. 4633-4642, 2001. - 19. V. Becher, A. Deymonnaz, and P. Heiber, “Efficient Computation of All Perfect Repeats in Genomic Sequences of Up to Half a Gigabyte, with a Case Study on the Human Genome,”
*Bioinformatics,*vol. 25, pp. 1746-1753, 2009. - 20. M. Domazet-Lo o, and B. Haubold, “Efficient Estimation of Pairwise Distances between Genomes,”
*Bioinformatics,*vol. 25, pp. 3221-3227, 2009. - 21. B. Haubold, P. Pfaffelhuber, M. Domazet-Lo o, and T. Wiehe, “Estimating Mutation Distances from Unaligned Genomes,”
*J. Computational Biology,*vol. 16, pp. 1487-1500, 2009. - 22. A.J. Pinho, P.J.S.G. Ferreira, S.P. Garcia, and J.M.O.S. Rodrigues, “On Finding Minimal Absent Words,”
*BMC Bioinformatics,*vol. 10, article 137, 2009. - 23. S. Garcia, A. Pinho, J. Rodrigues, C. Bastos, and P. Ferreira, “Minimal Absent Words in Prokaryotic and Eukaryotic Genomes,”
*PLoS ONE,*vol. 6, no. 1,article e16065, 2011. - 24. S. Garcia, and A. Pinho, “Minimal Absent Words in Four Human Genome Assemblies,”
*PLoS ONE,*vol. 6, no. 12,article e29344, 2011. - 25. The Int'l Human Genome Sequencing Consortium, “Initial Sequencing and Analysis of the Human Genome,”
*Nature,*vol. 409, pp. 860-921, 2001. - 26. The Int'l Human Genome Mapping Consortium, “A Physical Map of the Human Genome,”
*Nature,*vol. 409, pp. 934-941, 2001. - 27. The Int'l Human Genome Sequencing Consortium, “Finishing the Euchromatic Sequence of the Human Genome,”
*Nature,*vol. 431, pp. 931-945, 2004. - 28. M. Garber, et al., “Closing Gaps in the Human Genome Using Sequencing by Synthesis,”
*Genome Biology,*vol. 10, p. R60, 2009. - 29. “Editorial: E Pluribus Unum,”
*Nature Methods,*vol. 7, p. 331, 2010. - 30. S. Levy, et al., “The Diploid Genome Sequence of an Individual Human,”
*PLoS Biology,*vol. 5, pp. 2113-2144, 2007. - 31. S. Gnerre, et al., “High-Quality Draft Assemblies of Mammalian Genomes from Massively Parallel Sequence Data,”
*Proc. Nat'l Academy of Sciences USA,*vol. 108, pp. 1513-1518, 2011. - 32. “GenBank Website,” ftp://ftp.ncbi.nih.gov/genbank/genomes/ Eukaryotes/vertebrates_mammals/Homo_sapiens/HsapALLPATHS1/Primary_Assembly/unplaced_scaffolds/FASTA/unplaced.scaf.fa.gz, 2011.
- 33. R. Li, et al., “De Novo Assembly of Human Genomes with Massively Parallel Short Read Sequencing,”
*Genome Research,*vol. 20, pp. 265-272, 2010. - 34. “BGI-Shenzhen Website,” ftp://public.genomics.org.cn/BGI/yanhuang/Genomeassembly/asm_yanh.scafSeq.closure.gz, 2011.
- 35. A.J. Pinho, P.J.S.G. Ferreira, A.J.R. Neves, and C.A.C. Bastos, “On the Representability of Complete Genomes by Multiple Competing Finite-Context (Markov) Models,”
*PLoS ONE,*vol. 6, article e21588, 2011. - 36. J. Miller, et al., “Aggressive Assembly of Pyrosequencing Reads with Mates,”
*Bioinformatics,*vol. 24, pp. 2818-2824, 2008. - 37. J. Simpson, et al., “ABySS: A Parallel Assembler for Short Read Sequence Data,”
*Genome Research,*vol. 19, pp. 1117-1123, 2009. - 38. “Genome Assembly Gold-Standard Evaluations,” http://gage.cbcb.umd. edu/recipes/msrca.html, 2013.
- 39. D. Zerbino, and E. Birney, “Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs,”
*Genome Research,*vol. 18, pp. 821-829, 2008. - 40. J. Simpson, and R. Durbin, “Efficient De Novo Assembly of Large Genomes Using Compressed Data Structures,”
*Genome Research,*vol. 22, pp. 549-556, 2012. - 41. S. Koren, T. Treangen, and M. Pop, “Bambus 2: Scaffolding Metagenomes,”
*Bioinformatics,*vol. 27, pp. 2964-2971, 2011. - 42. G. Parra, K. Bradnam, Z. Ning, T. Keane, and I. Korf, “Assessing the Gene Space in Draft Genomes,”
*Nunleic Acids Research,*vol. 37, pp. 289-297, 2009. - 43. M. Burrows, and D.J. Wheeler,
*A Block-Sorting Lossless Data Compression Algorithm.*Digital Systems Research Center, 1994. - 44. M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch, “The Enhanced Suffix Array and Its Applications to Genome Analysis,”
*Proc. Second Workshop Algorithms in Bioinformatics,*pp. 449-463, 2002. - 45. T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park, “Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications,”
*Proc. 12th Ann. Symp. Combinatorial Pattern Matching,*pp. 182-192, 2001. - 46. M.D. McIlroy, “Suffix Arrays,” http://www.cs.dartmouth.edu/~doug/sarray/, 2010.
- 47. U. Manber, and G. Myers, “Suffix Arrays: A New Method for On-Line String Searches,”
*SIAM J. Computing,*vol. 22, pp. 935-948, 1993. - 48. T.M. Cover, and J.A. Thomas,
*Elements of Information Theory.*Wiley-Interscience, 2006. - 49. C.H. Bennett, P. Gács, M. Li, P.M.B. Vitányi, and W. Zurek, “Information Distance,”
*IEEE Trans. Information Theory,*vol. 44, no. 4, pp. 1407-1423, July 1998. - 50. R. Cilibrasi, and P.M.B. Vitányi, “Clustering by Compression,”
*IEEE Trans. Information Theory,*vol. 51, no. 4, pp. 1523-1545, Apr. 2005. - 51. T.C. Bell, J.G. Cleary, and I.H. Witten,
*Text Compression.*Prentice Hall, 1990. - 52. T. Bayes, “An Essay Towards Solving a Problem in the Doctrine of Chances,”
*Philosophical Trans. Royal Soc. London,*vol. 53, pp. 370-418, 1763. - 53. P.S. Laplace, “Mémoire Sur La Probabilité Des Causes Par Les Événements,”
*Mémoires De L'Académie Royale Des Sciences De Paris (Savants étrangers),*vol. 6, pp. 621-656, 1774, Reprinted in Oeuvres complètes de Laplace, Gauthier-Villars et fils, Paris, vol. 8, pp. 27-65, 1891. - 54. J. Venn,
*The Logic of Chance.*Macmillan and Co, 1888. - 55. G.F. Hardy, “Letter,”
*Insurance Record,*1889, Reprinted in*Trans. the Faculty of Actuaries,*vol. 8, pp. 180-181, 1920. - 56. E.T. Whittaker, “On Some Disputed Questions of Probability,”
*Trans. Faculty of Actuaries,*vol. 8, pp. 163-206, 1920. - 57. W.E. Johnson, “Probability: The Deductive and Inductive Problems,”
*Mind,*vol. 164, pp. 409-423, 1932. - 58. S.L. Zabell, “W.E. Johnson's Sufficientness Postulate,”
*Annals of Statistics,*vol. 10, pp. 1091-1099, 1982. - 59. S.L. Zabell, “The Rule of Succession,”
*Erkenntnis,*vol. 31, nos. 2/3, pp. 283-321, 1989. - 60. A.J. Pinho, D. Pratas, and P.J.S.G. Ferreira, “Bacteria DNA Sequence Compression Using a Mixture of Finite-Context Models,”
*Proc. IEEE Workshop Statistical Signal Processing (SSP '11),*pp. 125-128, June 2011.