简体繁体中英

tf-idf : should I do normalization of documents length

原文 2017-06-17 02:15:44 4 1 python/ normalization/ word/ tf-idf

When using TF-IDF to compare Document A, BI know that length of document is not important. But compared to AB, AC in this case, I think the length of document B, C should be the same length.

for example Log : 100 words Document A : 20 words Document B : 30 words

Log - A 's TF-IDF score : 0.xx Log - B 's TF-IDF score : 0.xx

Should I do normalization of document A,B? (If the comparison target is different, it seems to be a problem or wrong result)

1 answers

Generally you want to do whatever gives you the best cross validated results on your data.

If all you are doing to compare them is taking cosine similarity then you have to normalize the vectors as part of the calculation but it won't affect the score on account of varying document lengths. Many general document retrieval systems consider shorter documents to be more valuable but this is typically handled as a score multiplier after the similarities have been calculated.

Oftentimes ln(TF) is used instead of raw TF scores as a normalization feature because differences between seeing a term 1and 2 times is way more important than the difference between seeing a term 100 and 200 times; it also keeps excessive use of a term from dominating the vector and is typically much more robust.

tf-idf documents of different length

Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?

How to classify new documents with tf-idf?

scikit-learn - Should I fit model with TF or TF-IDF?

TF-IDF function

Interpreting the sum of TF-IDF scores of words across documents

Find the tf-idf score of specific words in documents using sklearn

tf-idf for large number of documents (>100k)

Calculating the TF-IDF of a query string over a trained set of documents

How should I go about using TF-IDF for text classification on the data I collected?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question tf-idf documents of different length Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query? How to classify new documents with tf-idf? scikit-learn - Should I fit model with TF or TF-IDF? TF-IDF function Interpreting the sum of TF-IDF scores of words across documents Find the tf-idf score of specific words in documents using sklearn tf-idf for large number of documents (>100k) Calculating the TF-IDF of a query string over a trained set of documents How should I go about using TF-IDF for text classification on the data I collected?

Related Tags

tf-idf : should I do normalization of documents length

Question

1 answers

solution1 4 ACCPTED 2017-06-17 03:12:37

solution1
4 ACCPTED 2017-06-17 03:12:37