19th IEEE International Conference on Tools with Artificial Intelligence - Vol.2 (ICTAI 2007)
Document Length Normalization by Statistical Regression
Paris, France
October 29-October 31
ISBN: 0-7695-3015-X
The document-length normalization problem has been widely studied in the field of Information Retrieval. The Cosine Normalization [2], the Maximum tf Normalization [1] and the Byte Length Normalization [12] are the most commonly used normalization techniques. In [14], authors studied the retrieval probability of documents w.r.t. their size, using different similarity measures. They have shown that none of existing measures retrieve the documents of dif- ferent lengths with the same probability. We first show here that the document and query sizes are indeed very influent on the similarity score expectation. Therefore, we propose to realize a statistical regression of the similarity scores dis- tribution w.r.t. document and query sizes in order to normal- ize them. Experimental results appear to indicate that our approach, as well in the field of classical Information Re- trieval as when applied to a document clustering process, allows to judge similarities really more fairly.
Citation:
Sylvain Lamprier, Tassadit Amghar, Bernard Levrat, Fr?d?ric Saubion, "Document Length Normalization by Statistical Regression," ictai, vol. 2, pp.11-18, 19th IEEE International Conference on Tools with Artificial Intelligence - Vol.2 (ICTAI 2007), 2007