简体   繁体   中英

tf-idf : should I do normalization of documents length

When using TF-IDF to compare Document A, BI know that length of document is not important. But compared to AB, AC in this case, I think the length of document B, C should be the same length.

for example Log : 100 words Document A : 20 words Document B : 30 words

Log - A 's TF-IDF score : 0.xx Log - B 's TF-IDF score : 0.xx

Should I do normalization of document A,B? (If the comparison target is different, it seems to be a problem or wrong result)

Generally you want to do whatever gives you the best cross validated results on your data.

If all you are doing to compare them is taking cosine similarity then you have to normalize the vectors as part of the calculation but it won't affect the score on account of varying document lengths. Many general document retrieval systems consider shorter documents to be more valuable but this is typically handled as a score multiplier after the similarities have been calculated.

Oftentimes ln(TF) is used instead of raw TF scores as a normalization feature because differences between seeing a term 1and 2 times is way more important than the difference between seeing a term 100 and 200 times; it also keeps excessive use of a term from dominating the vector and is typically much more robust.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM