Abstract—Genome assemblies are typically compared with respect to their contiguity, coverage, and accuracy. We propose a genome-wide, alignment-free genomic distance based on compressed maximal exact matches and suggest adding it to the benchmark of commonly used assembly quality metrics. Maximal exact matches are perfect repeats, without gaps or misspellings, which cannot be further extended to either their left- or right-end side without loss of similarity. The genomic distance here proposed is based on the normalized compression distance, an information-theoretic measure of the relative compressibility of two sequences estimated using multiple finite-context models. This measure exposes similarities between the sequences, as well as, the nesting structure underlying the assembly of larger maximal exact matches from smaller ones. We use four human genome assemblies for illustration and discuss the impact of genome sequencing and assembly in the final content of maximal exact matches and the genomic distance here proposed.
The importance of high-quality genome assemblies for biomedical research can hardly be overstated. As consequence of the reduction in sequencing costs and the increased throughput of sequencing technologies, enabling large-scale sequencing projects of many individuals in a population, or of many populations in an ecosystem, genome assembly is evermore an ubiquitous methodology. However, genome assembly remains one of the most difficult computational problems in genomics, which is particularly challenging for large and repeat-rich genomes, such as those of mammals.
There are numerous algorithms and software packages available for genome assembly and, likewise, a myriad of quality metrics, many of which were evaluated in the first Assemblathon competition [
^{1} ] and, more recently, in, for example, [
^{2} ], [
^{3} ]. Genome assemblies are typically compared based on their contiguity, coverage, and accuracy. The most commonly used metric to assess long-range contiguity is the N50 statistic, often complemented by other metrics such as contig and scaffold size. Coverage quantifies the amount of missing sequence in the draft assembly, using metrics such as the number of contigs and scaffolds, or the alignment to a reference genome sequence. The accuracy of an assembly can be measured in diverse ways, including the number of single nucleotide substitutions, insertion, and deletion structural errors, other large-scale structural errors, copy-number errors, and the assembly of specific annotated genomic regions such as genes and repeats.
Assembly quality metrics that rely on sequence comparison often require some measure of similarity or distance, i.e., a function
that assigns a real number to every pair of sequences
and
, such that the more similar
is to
, the smaller the value of
. Let
be a finite and ordered set that is called an
alphabet. Its elements are called
characters and its cardinality is
. A
string over the alphabet
is a finite sequence of elements of
. Let
be the set of all strings over
, equipped with the binary operation of concatenation, which is an associative operation. The
empty sequence is a neutral element for the operation of concatenation. Let
be a string of length
over
. A metric is a function
that satisfies the properties:
. Examples of distances between pairs of sequences include the subword distance
, where
and
are the lengths of sequences
and
, respectively, and
is the length of the longest common subword of
and
[
^{4} ]; the prefix/suffix distance
, where
and
are again the lengths of sequences
and
, respectively, and
is the length of the longest common prefix/suffix of
and
[
^{4} ]; or the Levenshtein edit distance, defined as the minimum number of edits needed to transform sequence
into sequence
, with allowable edit operations being insertion, deletion, or substitution of a single character [
^{5} ]. Regardless of their widespread application in bioinformatics and computational biology, the computation of edit distances for large data sets is still computationally prohibitive. The fastest algorithm for the case of constant-size alphabet remains the algorithm of Masek and Paterson [
^{6} ] that runs in
time. Moreover, edit distances are scarcely significant when applied to whole-genomes; hence, they are being replaced by global similarity measures that refer, implicitly or explicitly, to the subword composition of the sequences [
^{7} ]. Such distances are currently being explored for the challenge of population genomics [
^{8} ]. For problems of genome evolution, edit and permutation distances that operate on sequence fragments and impart genome rearrangements are key (e.g., [
^{9} ], [
^{10} ]). With these distances, not only the sequence content is evaluated, but also the order by which the different sequence fragments are observed. An interesting application of permutation distances pertains the problem of completing scaffolds from genome assemblies sequenced with high-throughput methodologies by comparison to a reference genome [
^{11} ], [
^{12} ]. Such a step is crucial in improving the quality of a genome assembly, but it does not quantify such improvement.
Because assembling repetitive regions remains the ultimate challenge, we propose a genome-wide, alignment-free genomic distance for assembly comparison based on the normalized compression distance of all unique maximal exact matches in a pair of sequences, and suggest adding it to the benchmark of commonly used assembly quality metrics. Maximal exact matches are perfect repeats, without gaps or misspellings, which cannot be further extended to either their left- or right-end side without loss of similarity. From a concept originally devised for string processing problems in the context of computer science, maximal exact matches have been applied to similar problems in the realm of computational biology, namely, for sequence analysis [
^{13} ]. They are key in massively parallel sequencing for seeding alignment of sequencing reads in genome assembly, and as anchor points in comparisons of closely related genomes [
^{14} ]. Some of the earlier applications of suffix trees and suffix arrays for finding maximal exact matches include the MUMmer algorithms [
^{15} ], [
^{16} ], [
^{17} ] and the REPuter software package [
^{18} ], whereas, recent additions include using sparse suffix arrays for efficiency [
^{14} ]. Other designations in the literature include maximal exact repeats [
^{13} ], maximal unique matching subsequences [
^{15} ], and perfect repeats [
^{19} ].
The normalized compression distance is part of a plethora of information-theoretic measures of the degree of similarity between two sequences based on their relative compressibility. The idea is that similar sequences are expected to contain a larger number of common subsequences, and therefore should be more compressible than distant sequences. The normalized compression distance is here estimated using multiple finite-context models. It exposes similarities between the sequences, as well as, the nesting structure underlying the assembly of larger maximal exact matches from smaller ones. An important advantage of these information-based similarity measures is that they do not require a feature extraction step.
The genomic distance here proposed is not an evolutionary distance. As an alignment-free distance, it is not based on homology, hence, it is not easily convertible to an evolutionary distance [
^{20} ] (though recent efforts have attempted a link between both [
^{21} ]). The genomic distance presented is proposed for comparing genome assemblies and, as such, will tendentiously be smaller between genome assemblies of equivalent rather than diverging quality, an effect that may even superimpose a larger evolutionary distance.
2.1 Preliminary Definitions
Let
be the
th character of string
, with
. A substring of
starting at position
and ending at position
is denoted by
, with
. If
, then
. Moreover,
(
) denotes the concatenation of character
(
) to the left- (right-) end side of
, with
. For convenience, consider also two additional characters,
and
, that do not belong to the alphabet
. By definition, the character to the left of the first character of string
is
, i.e.,
, while the character to the right of the last character of string
is
, i.e.,
(this definition of
and
incurs in a few formal, though innocuous, inconsistencies, ignored here for the sake of simplicity).
A maximal repeated pair in
is a pair of identical substrings (
) such that the character to the immediate left (right) of one of the substrings is different from the character to the immediate left (right) of the other substring (
and
). It is represented by a triple (
), where
and
are the starting positions of the two substrings, with
. A substring
is a
maximal exact match of
if there is at least a maximal repeated pair in
of the form (
) [
^{13} ]. Hence, by definition, each maximal exact match occurs at least twice, and a long maximal exact match may contain smaller maximal exact matches within itself. These definitions have been previously put forward elsewhere [
^{22} ], [
^{23} ], [
^{24} ] but are here included again for the sake of self-containment.
2.2 Genomic Sequences
We use four human genome assemblies as proof-of-concept. The human reference genome assembly GRCh37 build 37.1, available at the National Center for Biotechnology Information (NCBI) website, is used as the gold standard for the intraspecies comparisons detailed. This is an upgrade on the initial human reference genome sequenced with hierarchical shotgun capillary-based methodologies [
^{25} ], [
^{26} ], which has been carefully revised and corrected throughout the last decade (e.g., [
^{27} ], [
^{28} ]), thus significantly reducing the number of gaps from an initial value of
to
[
^{29} ]. We also use the May 2007 HuRef assembly of the genome of J. Craig Venter, sequenced with capillary-based whole-genome shotgun technologies using the Applied Biosystems 3730xl DNA analyzer and
de novo assembled with the Celera Assembler [
^{30} ], which is also available at the NCBI website. We use the NA12878 assembly of DNA from the cell line GM12878, sequenced with massively parallel sequencing technologies using Illumina Genome Analyzers and assembled with the ALLPATHS-LG program [
^{31} ], whose unplaced scaffolds are available at the GenBank website [
^{32} ]. Finally, we use the YH assembly of the genome of a Han Chinese, also sequenced with massively parallel sequencing technologies using Illumina Genome Analyzers and assembled with the SOAPdenovo assembler [
^{33} ], whose unplaced scaffolds are available at the BGI-Shenzhen website [
^{34} ]. All chromosomes (GRCh37 and HuRef assemblies) or scaffolds (NA12878 and YH assemblies) are initially concatenated using a delimiting character that does not belong to the alphabet
, to avoid artificial matches across boundaries. Moreover, we ignore all sequence ambiguities by replacing any subsequence of nucleotides not in
with the same delimiting character. Because this delimiting character is processed as a string terminator, the order by which the chromosomes or scaffolds are inserted is irrelevant and does not affect the results.
2.3 A Genomic Distance for Assembly Comparison Based on Compressed Maximal Exact Matches
To compute the genomic distance based on compressed maximal exact matches, we start by discovering all maximal exact matches in each genome assembly, using the algorithm detailed in Appendix A (see the supplemental material, which can be found on the Computer Society Digital Library at http://doi. ieeecomputersociety.org/10.1109/TCBB.2013.77). We discard information about the copy number, or multiplicity, of the maximal exact matches discovered, and refer to this set of single-copy words as all unique maximal exact matches.
An overall analysis of the distributions of all unique maximal exact matches is likely to mask important details, as the distributions are highly nonstationary. Hence, we divide the distributions in four ranges of maximal exact match lengths and define sets containing all maximal exact matches of length within the range. Set
S1 contains all maximal exact matches of length 1 bp
bp, set
S10 contains all maximal exact matches of length 10 bp
bp, set
S100 contains all maximal exact matches of length 100 bp
bp, and set
S1,000 contains all maximal exact matches of length 1,000 bp
bp, where unit bp stands for base pairs. For simplicity of notation, the
S in the set name will be, whenever appropriate, replaced by the first letter of the genome assembly it refers to. For example, set
G1 contains all maximal exact matches of length 1 to 9 bp in the GRCh37 assembly.
We compress these sets of maximal exact matches using the methodology of finite-context models described in Appendix D, available in the online supplemental material. We then assess their similarity by computing the normalized compression distance (NCD) in (3) of Appendix C, available in the online supplemental material, between each pair of sets of maximal exact matches.
Finally, we define the genomic distance based on compressed maximal exact matches between two genomic sequences as
(1)
where parameter
is the number of ranges of maximal exact match lengths, with a value of four for all genome assemblies in this study. It is possible to compare sequences that are divided into different number of sets, by correcting the denominator to the product of both numbers, instead of
. The genomic distance based on compressed maximal exact matches is, therefore, an average normalized compression distance between sequences
and
estimated over all sets of maximal exact matches in each sequence. However, unlike the normalized compression distance,
is symmetric by construction, i.e.,
.
An implementation of this genomic distance is freely available for noncommercial use in a software package called "Quality of Assemblies by Repeat Compression" at http:// bioinformatics.ua.pt/software/quarc/.
3. Results and Discussion
Fig. 1 displays the distribution of all unique maximal exact matches as a function of the maximal exact match length in units of base pairs (bp) for each human genome assembly. For small lengths, the number of maximal exact matches increases exponentially with
, where 4 is the alphabet cardinality and
is the length of the maximal exact match, as all
possible words occur and are maximal exact matches. Close to and beyond the inflexion point (at length 16 bp), the number of maximal exact matches becomes smaller than the number of
possible words. The rate of decrease in the number of maximal exact matches beyond this inflexion points is initially exponential (until maximal exact repeat lengths of
bp), followed by a subexponential rate in the right heavy-tails. Contrasting these distributions to those of random strings highlights the absence of heavy-tails in the latter, for all sequence lengths and alphabet cardinalities tested (Fig. 3 in Appendix B, available in the online supplemental material). In particular, random strings of length
characters in a four-letter alphabet, which mimic the genome size and alphabet cardinality of the human genome, have maximal exact matches with a maximum length of
bp. Hence, the presence of right heavy-tails is an exclusive property of genomic sequences.
Fig. 1. Distribution of the number of unique maximal exact matches as a function of the maximal exact match length in units of base pairs (bp) in four human genome assemblies.
Table 1 summarizes the statistics of the distributions of maximal exact matches in each human genome assembly.
Size refers to the size of the accessible genome, which is the number of unambiguous nucleotides (A,C,G or T) in units of base pairs (bp). The total
number of unique maximal exact matches does not consider their copy number, or multiplicity. The columns
minLen,
medLen, and
maxLen display, respectively, the minimum, median, and maximum length of the maximal exact matches in the genome assembly, in units of base pairs. The columns
minCN and
maxCN are the minimum and maximum copy number, respectively, in the genome assembly. For completeness, we looked into the variation in copy number as a function of the maximal exact match length and observed similar results for all four human genome assemblies. For very small and very large maximal exact matches, the copy number variance is small, with the former having consistently high copy number and the latter having very small copy numbers. In the middle of the distribution of maximal exact match lengths, the copy number variance is significant, implying that for a given length, some maximal exact matches will have small copy number, where others will have very large copy numbers. Results for random strings are similar in the ranges of small and large maximal exact match lengths, but the copy number variance of maximal exact matches of median length is much smaller in random strings than in genomic sequences.
Table 1. Summary Statistics of the Distributions of Maximal Exact Matches in Four Human Genome Assemblies
Size refers to the accessible genome. Number is the total number of unique maximal exact matches in the genome assembly. minLen, medLen, and maxLen are, respectively, the minimum, median, and maximum length of the maximal exact matches in units of base pairs. minCN and maxCN are the minimum and maximum copy number, respectively.
Because overall statistics may mask important details, particularly in nonstationary distributions, we investigated different sets of maximal exact matches separately (Section 2.3).
Table 2 displays the number of unique maximal exact matches per set in each human genome assembly. The number of maximal exact matches in the smaller range of lengths (sets
S1) is equal in all human genome assemblies. This similarity is increasingly lost in ranges of increasingly longer maximal exact matches.
Table 2. Number of Unique Maximal Exact Matches per Set in Four Human Genome Assemblies
The data in
Fig. 1 and
Tables 1 and
2 highlight well-known difficulties in the
de novo assembly of large repetitive regions. The analysis of the total number of maximal exact matches, their length, and copy number, using the reference GRCh37 assembly as gold standard, shows a decreasing tendency as one moves from the capillary-based HuRef assembly, to the massively parallel NA12878 and YH assemblies. The explanation for this decrease lies with both the sequencing technology and the assembly algorithm. GRCh37 and HuRef were both sequenced with capillary-based methodologies, though using different approaches, but the former has been continuously revised and corrected, rendering it the most complete and accurate human genome sequence so far (HuRef covers
percent of GRCh37). NA12878 and YH were both sequenced with massively parallel methodologies, but the former presents a larger and richer content in maximal exact matches, which is in accordance with the claim by the authors that the quality of this NA12878 assembly (which only covers
percent of GRCh37) approaches that of capillary-based sequencing assemblies [
^{31} ]. A key factor for explaining the larger number of median and large maximal exact matches in the NA12878 assembly is that it benefited from a large fosmid library of variable-length inserts, including 40 kilobases (kb). Clearly, the assembly methodology also influences the final genome assembly, and ALLPATHS-LG fared better than SOAPdenovo in the Assemblathon competition [
^{1} ].
We compressed the sets of unique maximal exact matches using eight finite-context models with context depths
(Appendix D, available in the online supplemental material). Sets of maximal exact matches were sorted lexicographically before compression. Probabilities were estimated with
in (4) in the case of larger contexts (14 and 16) and the default value of
otherwise [
^{35} ]. Inverted repeats were considered by selecting the respective flag. Our experience in using multiple finite-context models for compressing long genomic sequences of repeat-rich genomes such as the human genome, points at this being a good combination of parameters for ensuring a good final compression.
Fig. 2 displays the normalized compression distance (3) between sets of maximal exact matches as colormaps. Smaller normalized compression distance values imply greater sequence similarity, whereas larger normalized compression distance values imply greater sequence dissimilarity or difference. As there is some dependence on the compressor used, the matrices displayed in
Fig. 2 are not symmetrical (e.g.,
). Moreover, the values of the normalized compression distance are slightly larger than zero for comparisons of the same set (e.g.,
) due to some overheads of the coding scheme. Likewise, the values of the normalized compression distance may be slightly larger than one for very dissimilar sets.
Fig. 2. Normalized compression distance between sets of maximal exact matches in four human genome assemblies.
The data in
Fig. 2 highlight the nesting structure underlying the assembly of larger maximal exact matches from smaller ones. For example, compare the shade of red in the squares corresponding to the comparison of sets
G1 and
G1000. Clearly, knowing the small maximal exact matches previously (light red square in the first row of the fourth column) improves the coding of the larger maximal exact matches (the converse scenario is represented by a dark red square in the fourth row of the first column).
Fig. 2 also exposes similarities in maximal exact matches between genome assemblies, as exemplified by the light blue squares corresponding to the comparison of sets
G1 and
H1 (first row of the fifth column), sets
G1 and
N1 (first row of the ninth column), and sets
G1 and
Y1 (first row of the 13th column). If the normalized compression distance had not been assessed by length range but globally, all detail would have been lost.
The normalized compression distances ((3) and
Fig. 2 ) were used to estimate the genomic distance for assembly comparison based on compressed maximal exact matches in (1), whose values are displayed in
Table 3 . The HuRef assembly has the second largest content in maximal exact matches (
Table 1 ) and the greatest sequence similarity to the reference GRCh37 assembly, followed by the NA12878 and YH assemblies, respectively. This ranking makes evident the desirable direct dependency of the genomic distance here proposed on the overall content in maximal exact matches of the genome assemblies being compared. Two key factors will impart this overall content. The first is organismal and genome evolution, as two individuals from the same species will share a larger number and content of genomic features than those from different species. The second factor is current limitations of sequencing technologies and assembly algorithms that hinder the uncovering of repetitive regions in the draft assemblies.
Table 3. The Genomic Distance Based on Compressed Maximal Exact Matches between Four Human Genome Assemblies
The genomic distance here proposed for assembly comparison may be used with data other than whole-genome assemblies. We use data made available by a recent critical assessment of genome assemblies and assembly algorithms as an example [
^{2} ]. The data consist of 10 different assemblies of the human chromosome 14 and we consider three ranges of maximal exact match lengths, namely, sets
S1,
S10, and
S100. As before, we do not consider length ranges for which some sets of maximal exact matches would be empty.
Table 4 displays the genomic distance for assembly comparison based on compressed maximal exact matches (1) resulting from the comparison of the reference GRCh37 assembly of this chromosome with other nine assemblies. In accordance with the findings reported in the aforementioned study, for which "ALLPATHS-LG and CABOG stood out as assemblers capable of producing both high contiguity and high accuracy" assemblies [
^{2} ], the genomic distance here proposed outputs CABOG and ALLPATHS-LG as the chromosome assemblies more similar to the reference GRCh37 assembly.
Table 4. The Genomic Distance Based on Compressed Maximal Exact Matches for Human Chromosome 14 between the Reference GRCh37 Assembly and Other Nine Assemblies
We have proposed a genomic distance for assembly comparison based on the normalized compression distance of all unique maximal exact matches in two genome sequences. The normalized compression distance measures the improvement in the compression of a given sequence by previously knowing another sequence through the concatenation of both strings. As discussed in Appendix C, available in the online supplemental material, the normalized compression distance and, by consequence, the genomic distance here proposed, depend on the compressor used. However, consistency in the use of an optimized compressor throughout all comparisons will minimize this potential caveat. The same rationale should be applied to the choice of compression models (Appendix D, available in the online supplemental material). Moreover, string operations other than the concatenation of both full-length strings, such as a random mixture of subsequences from both sequences, are also interesting to consider and will be explored in future work.
The genomic distance proposed is aimed at the comparison of genome assemblies. By providing a quantification of missing repeats in a genome assembly, as well as their identification, this measure is suited as a quality metric of both coverage and accuracy. Repeat analysis in the first Assemblathon competition [
^{1} ] was based on simulated reads. Though this is the appropriate setting for comparing assemblies, as the "correct" answer is known, it has the limitation of conveying no information on how the various assembly methods and algorithms fare with real, large, and repeat-rich data sets. Therefore, we suggest adding this genomic distance to the benchmark of commonly used assembly quality metrics [
^{1} ], to complement other quality metrics based on the analysis of specific genomic regions (e.g., [
^{42} ]). Studies that comprehensively assess genome assemblies often report the dispersion of quality metrics and lack of consistency of the evaluations [
^{1} ], [
^{2} ], [
^{3} ]. The rankings of assemblies here obtained are concordant with overall assembly evaluations; hence, this quality metric does not contradict other measures of contiguity.
Finally, though the rationale for the genomic distance here proposed depends on intraspecies data and the existence of a "gold standard" to which assemblies are compared, a high-quality assembly of an evolutionary-close organism may be used in the absence of such standard.
Acknowledgments
This work was supported by project grants FCOMP-01-0124-FEDER-010095, FCOMP-01-0124-FEDER-022682, and Incentivo/EEI/UI0127/2013 of the Portuguese Foundation for Science and Technology (Fundação para a Ci
ncia e a Tecnologia), by the Operational Programme Thematic Factors of Competitiveness (Programa Operacional Factores de Competitividade, COMPETE), by the National Strategic Reference Framework (Quadro de Refer
ncia Estratégico Nacional, QREN), and by the European Regional Development Fund (Fundo Europeu de Desenvolvimento Regional, FEDER). SS was funded by a grant from Project FCOMP-01-0124-FEDER-010095. SPG acknowledges funding from the European Social Fund and the Portuguese Ministry of Education and Science.
S.P. Garcia is with the Signal Processing Laboratory, Institute of Electronics and Telematics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal. E-mail: spgarcia@ua.pt.
J.M.O.S. Rodrigues, C.A.C. Bastos, P.J.S.G. Ferreira, and A.J. Pinho are with the Signal Processing Laboratory, Institute of Electronics and Telematics Engineering of Aveiro, and the Department of Electronics, Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.
S. Santos and D. Pratas are with the Signal Processing Laboratory, Institute of Electronics and Telematics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
V. Afreixo is with the Department of Mathematics, University of Aveiro, 3810-193 Aveiro, Portugal, and the Center for Research & Development in Mathematics and Applications, University of Aveiro, 3810-193 Aveiro, Portugal.
Manuscript received 26 Feb. 2013; revised 6 June 2013; accepted 18 June 2013; published online 1 July 2013.
For information on obtaining reprints of this article, please send e-mail to: tcbb@computer.org, and reference IEEECS Log Number TCBB-2013-02-0062.
Digital Object Identifier no. 10.1109/TCBB.2013.77.