This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Antisequential Suffix Sorting for BWT-Based Data Compression
April 2005 (vol. 54 no. 4)
pp. 385-397
Suffix sorting requires ordering all suffixes of all symbols in an input sequence and has applications in running queries on large texts and in universal lossless data compression based on the Burrows Wheeler transform (BWT). We propose a new suffix lists data structure that leads to three fast, antisequential, and memory-efficient algorithms for suffix sorting. For a {\rm length}{\hbox{-}}N input over a {\rm size}{\hbox{-}}|{\cal X}| alphabet, the worst-case complexities of these algorithms are \Theta(N^2), O(|{\cal X}|N\log({\frac{N}{|{\cal X}|}})), and O(N\sqrt{|{\cal X}|\log({\frac{N}{|{\cal X}|}})}), respectively. Furthermore, simulation results indicate performance that is competitive with other suffix sorting methods. In contrast, the suffix sorting methods that are fastest on standard test corpora have poor worst-case performance. Therefore, in comparison with other suffix sorting methods, suffix lists offer a useful trade off between practical performance and worst-case behavior. Another distinguishing feature of suffix lists is that these algorithms are simple; some of them can be implemented in VLSI. This could accelerate suffix sorting by at least an order of magnitude and enable high-speed BWT-based compression systems.

[1] M. Burrows and D.J. Wheeler, “A Block-Sorting Lossless Data Compression Algorithm,” SRC Research Report 124, Digital Systems Research Center, Palo Alto, Calif., May 1994.
[2] M. Effros, K. Visweswariah, S. Kulkarni, and S. Verdú, “Universal Lossless Source Coding with the Burrows Wheeler Transform,” IEEE Trans. Information Theory, vol. 48, no. 5, pp. 1061-1081, May 2002.
[3] M. Effros, “PPM Performance with BWT Complexity: A Fast and Effective Data Compression Algorithm,” Proc. IEEE, vol. 88, no. 11, pp. 1703-1712, Nov. 2000.
[4] B. Balkenhol and S. Kurtz, “Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice,” IEEE Trans. Computers, vol. 49, no. 10, pp. 1043-1053, Oct. 2000.
[5] M. Nelson, “Data Compression with the Burrows-Wheeler Transform,” Dr. Dobb's J., pp. 46-50, Sept. 1996.
[6] J.G. Cleary, W.J. Teahan, and I.H. Witten, “Unbounded Length Contexts for PPM,” Proc. Data Compression Conf., pp. 52-61, Mar. 1995.
[7] F.M.J. Willems, W.M. Shtarkov, and T.J. Tjalkens, “The Context-Tree Weighting Method: Basic Properties,” IEEE Trans. Information Theory, vol. 41, no. 3, pp. 653-664, May 1995.
[8] J. Rissanen, “Fast Universal Coding with Context Models,” IEEE Trans. Information Theory, vol. 45, no. 4, pp. 1065-1071, May 1999.
[9] D. Baron and Y. Bresler, “Tree Source Identification with the Burrows Wheeler Transform,” Proc. 2000 Conf. Information Sciences and Systems, pp. FA1-10-FA1-15, Mar. 2000.
[10] D. Baron and Y. Bresler, “An $O(N)$ Semipredictive Universal Encoder via the BWT,” IEEE Trans. Information Theory, vol. 50, no. 5, pp. 928-937, May 2004.
[11] J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression,” IEEE Trans. Information Theory, vol. 23, no. 3, pp. 337-343, May 1977.
[12] N. Ranganathan and S. Henriques, “High-Speed VLSI Designs for Lempel-Ziv-Based Data Compression,” IEEE Trans. Circuits and Systems-II: Analog and Digital Signal Processing, vol. 40, no. 2, pp. 96-106, Feb. 1993.
[13] K.-J. Lin and C.-W. Wu, “A Low-Power CAM Design for LZ Data Compression,” IEEE Trans. Computers, vol. 49, no. 10, pp. 1139-1145, Oct. 2000.
[14] P. Elias, “Interval and Recency Rank Source Coding: Two On-Line Adaptive Variable-Length Schemes,” IEEE Trans. Information Theory, vol. 33, no. 1, pp. 3-10, Jan. 1987.
[15] S. Jones, “100 Mbit/s Adaptive Data Compressor Design Using Selectively Shiftable Content-Addressable Memory,” IEE Proc.-G Circuits Devices & Systems, vol. 139, no. 4, pp. 498-502, Aug. 1992.
[16] J.F. Myoupo and A. Wabbi, “Move-to-Front and Transpose Hybrid Parallel Architectures for High-Speed Data Compression,” Proc. 19th IEEE Int'l Performance, Computing, and Comm. Conf. (IPCCC 2000), pp. 67-75, Feb. 2000.
[17] S. Jones, “Partial-Matching Lossless Data Compression Hardware,” IEE Proc.-E Computers & Digital Techniques, vol. 147, no. 5, pp. 329-334, Sept. 2000.
[18] M. Schindler, “A Fast Block-Sorting Algorithm for Lossless Data Compression,” Proc. Data Compression Conf., p. 469, Mar. 1997.
[19] D. Baron, “Fast Parallel Algorithms for Universal Lossless Source Coding,” PhD dissertation, Univ. of Illinois, Urbana, Feb. 2003.
[20] T.H. Cormen, C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms. Cambridge, Mass.: MIT Press, 1990.
[21] P. Weiner, “Linear Pattern Matching Algorithm,” Proc. IEEE 14th Symp. Switching and Automata Theory, pp. 1-11, Oct. 1973.
[22] E.M. McCreight, “A Space-Economical Suffix Tree Construction Algorithm,” J. ACM, vol. 23, no. 2, pp. 262-272, Apr. 1976.
[23] N.J. Larsson, “Extended Application of Suffix Trees to Data Compression,” Proc. Data Compression Conf., pp. 190-199, Apr. 1996.
[24] S. Kurtz, “Reducing the Space Requirement of Suffix Trees,” Software— Practice and Experience, vol. 29, no. 13, pp. 1149-1171, 1999.
[25] E. Ukkonen, “On-Line Construction of Suffix Trees,” Algorithmica, vol. 14, no. 3, pp. 249-260, Sept. 1995.
[26] J. Seward, “On the Performance of BWT Sorting Algorithms,” Proc. Data Compression Conf., pp. 173-182, Mar. 2000.
[27] J.L. Bentley and R. Sedgewick, “Fast Algorithms for Sorting and Searching Strings,” Proc. ACM-SIAM Symp. Discrete Algorithms, pp. 360-369, Jan. 1997.
[28] N.J. Larsson and K. Sadakane, “Faster Suffix Sorting,” Technical Report LU-CS-TR:99-214, Dept. of Computer Science, Lund Univ., Sweden, May 1999.
[29] R.M. Karp, R.E. Miller, and A.L. Rosenberg, “Rapid Identification of Repeated Patterns in Strings, Trees and Arrays,” Proc. IEE Symp. Foundations of Computer Science, pp. 125-136, May 1972.
[30] J. Kärkkäinen and P. Sanders, “Simple Linear Work Suffix Array Construction,” Proc. 30th Int'l Colloquium Automata, Languages, and Programming, pp. 943-955, 2003.
[31] P. Ko and S. Aluru, “Space-Efficient Linear Time Construction of Suffix Arrays,” Proc. 14th Ann. Symp. Combinatorial Pattern Matching, pp. 200-210, 2003.
[32] D.K. Kim, J.S. Sim, H. Park, and K. Park, “Linear-Time Construction of Suffix Arrays,” Proc. 14th Ann. Symp. Combinatorial Pattern Matching, pp. 186-199, 2003.
[33] W. Hon, K. Sadakane, and W. Sung, “Breaking a Time-and-Space Barrier in Constructing Full-Text Indices,” Proc. 44th Ann. IEEE Symp. Foundations of Computer Science (FOCS '03), pp. 251-260, Oct. 2003.
[34] S. Burkhardt and J. Kärkkäinen, “Fast Lightweight Suffix Array Construction and Checking,” Proc. 14th Ann. Symp. Combinatorial Pattern Matching, pp. 55-69, 2003.
[35] G. Manzini and P. Ferragina, “Engineering a Lightweight Suffix Array Construction Algorithm,” Proc. 10th European Symp. Algorithms, pp. 698-710, 2002.
[36] J. Seward, “Space-time Tradeoffs in the Inverse B-W Transform,” Proc. Data Compression Conf., pp. 439-448, Mar. 2001.

Index Terms:
Burrows Wheeler transform, data compression, source coding, suffix sorting, VLSI.
Citation:
Dror Baron, Yoram Bresler, "Antisequential Suffix Sorting for BWT-Based Data Compression," IEEE Transactions on Computers, vol. 54, no. 4, pp. 385-397, April 2005, doi:10.1109/TC.2005.56
Usage of this product signifies your acceptance of the Terms of Use.