Fourth International Conference Document Analysis and Recognition (ICDAR'97) Language identification of on-line documents using word shapes Ulm, GERMANY August 18-August 20 ISBN: 0-8186-7898-4
The authors have extended existing methods to identify the language of an on-line document after the characters have been coded using 10 character classes based on visual characteristics. In particular, they exploit word bigrams and trigrams in both a linear combination of score values and an expert systems approach. Knowledge about each language as acquired from a large number of on-line texts. Using a small set of rules, the expert system outperforms the linear combination in accuracy and shows more stability when parameter settings are varied.
Index Terms:
identification; language identification; on-line documents; word shapes; coded characters; character classes; visual characteristics; word bigrams; word trigrams; linear score value combination; expert system; knowledge acquisition; on-line texts; rules; accuracy; stability; varied parameter settings
Citation:
N. Nobile, S. Bergler, C.Y. Suen, S. Khoury, "Language identification of on-line documents using word shapes," icdar, pp.258, Fourth International Conference Document Analysis and Recognition (ICDAR'97), 1997 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||