This Article 
 Bibliographic References 
 Add to: 
Pattern Matching in LZW Compressed Files
August 2005 (vol. 54 no. 8)
pp. 929-938
Tao Tao, IEEE
Compressed pattern matching is an emerging research area that addresses the following problem: Given a text file in compressed format and a pattern, report the occurrence(s) of the pattern in the file with minimal (or no) decompression. In this paper, we report our work on compressed pattern matching in LZW compressed files. The work includes an extension of Amir et al.'s well-known "almost-optimal” algorithm. The original Amir et al.'s algorithm has been improved to search not only the first occurrence of the pattern but also all other occurrences. A faster implementation for so-called "simple patterns” is also proposed. The work also includes a novel multiple-pattern matching algorithm using the Aho-Corasick algorithm. The algorithm takes O(mt+n+r) time with O(mt) extra space, where n is the size of the compressed file, m is the total length of all patterns, t is the size of the LZW trie, and r is the number of occurrences of the patterns. Extensive experiments have been conducted to test the performance of our algorithms and to compare with other well-known compressed pattern matching algorithms, particularly the BWT-based algorithms and another similar multiple-pattern matching algorithm by Kida et al. that also uses the Aho-Corasick algorithm on the LZW compressed data. The results showed that our multiple-pattern matching algorithm is competitive among the best compressed pattern-matching algorithms and is practically the fastest among all approaches when the number of patterns is not very large. Therefore, our algorithm is preferable for general string matching applications. The proposed algorithm is efficient for large files and it is particularly efficient when being applied on archive search if the archives are compressed with a common LZW trie. LZW is one of the most efficient and popular compression algorithms used extensively and our method requires no modification on the compression algorithm. The work reported in this paper, therefore, has great economic and market potential.

[1] D.A. Adjeroh, A. Mukherjee, M. Powell, T.C. Bell, and N. Zhang, “Pattern Matching in BWT-Compressed Text,” Proc. Data Compression Conf., p. 445, Apr. 2002.
[2] A. Amir, G. Benson, and M. Farach, “Let Sleeping Files Lie: Pattern Matching in Z-Compressed File,” J. System Sciences, vol. 52, pp. 299-307, 1996.
[3] P. Barcaccia, A Cresti, and S.D. Agostino, “Pattern Matching in Text Compressed with the ID Heuristic,” Proc. Data Compression Conf., pp. 113-118, 1998.
[4] T.C. Bell, D.A. Adjeroh, and A. Mukherjee, “Pattern Matching in Compressed Text and Images,” May 2001, http://www.csee. cpmPapersacmSurvey2001.pdf.
[5] T. Bell, M. Powell, A. Mukherjee, and D. Adjeroh, “Searching BWT Compressed Text with the Boyer-Moore Algorithm and Binary Search,” Proc. Data Compression Conf., pp. 112-121, Apr. 2002.
[6] M. Farach and M. Thorup, “String Matching in Lempel-Ziv Compressed Strings,” Algorithmica, vol. 20, pp. 388-404, 1998.
[7] P. Ferragina and G. Manzini, “An Experimental Study of an Opportunistic Index,” Proc. 12th ACM-SIAM Symp. Discrete Algorithms (SODA 2001), pp. 269-278, 2001.
[8] A. Firth, “A Comparison of BWT Approaches to Compressed-Domain Pattern Matching,” Honors report, Univ. of Canterbury, reports/Hons Reps/2002hons_0205.pdf, accepted for publication.
[9] D.E. Knuth, J.H. Morris, and V.R. Pratt, SIAM J. Computing, vol. 6, p. 323, 1977.
[10] G. Navarro and M. Raffinot, “A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text,” Proc. Combinatorial Pattern Matching, pp. 14-36, 1999.
[11] T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa, “Multiple Pattern Matching in LZW Compressed Text,” Proc. Data Compression Conf., p. 103, Mar. 1998.
[12] T. Kida, M. Takeda, M. Miyazaki, A. Shinohara, and S. Arikawa, “Multiple Pattern Matching in LZW Compressed Text,” J. Discrete Algorithm, vol. 1, no. 1, 2000.
[13] A.V. Aho and M.J. Corasick, “Efficient String Matching,” Comm. ACM, vol. 18, no. 6, pp. 333-340, June 1975.
[14] N. Zhang, T. Tao, R.V. Satya, and A. Mukherjee, “A Modified LZW Algorithm for Compressed Text Retrieval with Random Access Property,” draft, Computer Science Dept., Univ. of Central Florida, 2004.

Index Terms:
Index Terms- Data compaction and compression, information search and retrieval, pattern matching.
Tao Tao, Amar Mukherjee, "Pattern Matching in LZW Compressed Files," IEEE Transactions on Computers, vol. 54, no. 8, pp. 929-938, Aug. 2005, doi:10.1109/TC.2005.133
Usage of this product signifies your acceptance of the Terms of Use.