Issue No. 02 - February (1979 vol. 1)
Ching Y. Suen , SENIOR MEMBER, IEEE, Department of Computer Science, Concordia University, Montreal, P.Q., Canada; Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA 02139.
n-gram (n = 1 to 5) statistics and other properties of the English language were derived for applications in natural language understanding and text processing. They were computed from a well-known corpus composed of 1 million word samples. Similar properties were also derived from the most frequent 1000 words of three other corpuses. The positional distributions of n-grams obtained in the present study are discussed. Statistical studies on word length and trends of n-gram frequencies versus vocabulary are presented. In addition to a survey of n-gram statistics found in the literature, a collection of n-gram statistics obtained by other researchers is reviewed and compared.
Statistics, Natural languages, Text processing, Humans, Statistical distributions, Application software, Vocabulary, Error correction, Frequency, Optical character recognition software
C. Y. Suen, "n-Gram Statistics for Natural Language Understanding and Text Processing," in IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 1, no. 2, pp. 164-172, .