This Article 
 Bibliographic References 
 Add to: 
n-Gram Statistics for Natural Language Understanding and Text Processing
February 1979 (vol. 1 no. 2)
pp. 164-172
n-gram (n = 1 to 5) statistics and other properties of the English language were derived for applications in natural language understanding and text processing. They were computed from a well-known corpus composed of 1 million word samples. Similar properties were also derived from the most frequent 1000 words of three other corpuses. The positional distributions of n-grams obtained in the present study are discussed. Statistical studies on word length and trends of n-gram frequencies versus vocabulary are presented. In addition to a survey of n-gram statistics found in the literature, a collection of n-gram statistics obtained by other researchers is reviewed and compared.
Index Terms:
Statistics,Natural languages,Text processing,Humans,Statistical distributions,Application software,Vocabulary,Error correction,Frequency,Optical character recognition software,word length analysis,Character recognition,context,language understanding,n-gram statistics,positional distributions of letters,text processing
"n-Gram Statistics for Natural Language Understanding and Text Processing," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, no. 2, pp. 164-172, Feb. 1979, doi:10.1109/TPAMI.1979.4766902
Usage of this product signifies your acceptance of the Terms of Use.