This Article 
 Bibliographic References 
 Add to: 
Substitution Deciphering Based on HMMs with Applications to Compressed Document Processing
December 2002 (vol. 24 no. 12)
pp. 1661-1666

Abstract—It has been shown that simple substitution ciphers can be solved using statistical methods such as probabilistic relaxation. However, the utility of such solutions has been limited by their inability to cope with noise encountered in practical applications. In this paper, we propose a new solution to substitution deciphering based on hidden Markov models. We show that our algorithm is more accurate than relaxation and much more robust in the presence of noise, making it useful for applications in compressed document processing. Recovering character interpretations from the sequence of cluster identifiers in a symbolically compressed document can be treated as a cipher problem. Although a significant amount of noise is present in the cluster sequence, enough information can be recovered with a robust deciphering algorithm to accomplish certain document analysis tasks. The feasibility of this approach is demonstrated in a multilingual document duplicate detection system.

[1] R. Casey and G. Nagy, “Autonomous Reading Machine,” IEEE Trans. Computers, vol. 17, no. 5, pp. 492-503, May 1968.
[2] D. Doermann, H. Li, and O. Kia, “The Detection of Duplicates in Document Image Databases,” Proc. Int'l Conf. Document Analysis and Recognition, pp. 314-318, Aug. 1997.
[3] R. Ganesan and A.T. Sherman, “Statistical Techniques for Language Recognition: An Introduction and Guide for Cryptanalysts,” Cryptologia, vol. 17, no. 4, pp. 321-366, 1993.
[4] P. Haffner, L. Bottou, P.G. Howard, and Y. Le Cun, “DjVu: Analyzing and Compressing Scanned Documents for Internet Distribution,” Proc. Int'l Conf. Document Analysis and Recognition, pp. 625-628, Sept. 1999.
[5] P. Howard, F. Kossentini, B. Martins, S. Forchhammer, and W.J. Rucklidge, “The Emerging JBIG2 Standard,” IEEE Trans. Circuits and Systems for Video Technology, vol. 8, no. 7, pp. 838-848, Nov. 1998.
[6] J. Hull and P. Hart, “Toward Zero-Effort Personal Document Management,” Computer, vol. 34, no. 3, pp. 30-35, Mar. 2001.
[7] D.G.N. Hunter and A.R. McKenzie, “Experiments with Relaxation Algorithms for Breaking Simple Substitution Ciphers,” The Computer J., vol. 26, no. 1, pp. 68-71, 1983.
[8] J. King and D. Bahler, “An Implementation of Probabilistic Relaxation in the Cryptanalysis of Simple Substitution Ciphers,” Cryptologia, vol. 16, no. 3, pp. 215-225, 1992.
[9] G.E. Kopec and P.A. Chou, “Document Image Decoding Using Markov Source Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 6, pp. 602-617, June 1994.
[10] D-S. Lee and J.J. Hull, “Duplicate Detection for Symbolically Compressed Documents,” Proc. Fifth Int'l Conf. Document Analysis and Recognition, pp. 305-308, Sept. 1999.
[11] G. Nagy, S. Seth, and K. Einspahr, “Decoding Substitution Ciphers by Means of Word Matching with Application to OCR,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 9, no. 5, pp. 710-715, May 1987.
[12] S. Peleg and A. Rosenfeld, “Breaking Substitution Ciphers Using a Relaxation Algorithm,” Comm. ACM, vol. 22, no. 11, pp. 598-605, Nov. 1979.
[13] L.R. Rabiner, “Tutorial on Hidden Markov Model and Selected Applications in Speech Recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257-285, 1989.
[14] G. Salton, Automatic Text Processing. Addison-Wesley, 1988.
[15] I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, 1994.

Index Terms:
Substitution ciphers, HMM, symbolic compression.
Dar-Shyang Lee, "Substitution Deciphering Based on HMMs with Applications to Compressed Document Processing," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1661-1666, Dec. 2002, doi:10.1109/TPAMI.2002.1114860
Usage of this product signifies your acceptance of the Terms of Use.