简体繁体 English

TF-IDF：我应该规范化文档长度吗

[英]tf-idf : should I do normalization of documents length

原文 2017-06-17 02:15:44 1 1 python/ normalization/ word/ tf-idf

When using TF-IDF to compare Document A, BI know that length of document is not important. 当使用TF-IDF比较文档A时，BI知道文档的长度并不重要。 But compared to AB, AC in this case, I think the length of document B, C should be the same length. 但是在这种情况下，与AB，AC相比，我认为文档B，C的长度应该相同。

for example Log : 100 words Document A : 20 words Document B : 30 words 例如日志：100字文档A：20字文档B：30字

Log - A 's TF-IDF score : 0.xx Log - B 's TF-IDF score : 0.xx Log-A的TF-IDF分数：0.xx Log-B的TF-IDF分数：0.xx

Should I do normalization of document A,B? 我应该对文件A，B进行标准化吗？ (If the comparison target is different, it seems to be a problem or wrong result) （如果比较目标不同，则可能是问题还是错误的结果）

1 个解决方案

Generally you want to do whatever gives you the best cross validated results on your data. 通常，您想做的任何事情都会为您的数据提供最佳的交叉验证结果。

If all you are doing to compare them is taking cosine similarity then you have to normalize the vectors as part of the calculation but it won't affect the score on account of varying document lengths. 如果您要进行比较以比较余弦相似度，则必须将向量归一化作为计算的一部分，但是由于文档长度的变化，它不会影响分数。 Many general document retrieval systems consider shorter documents to be more valuable but this is typically handled as a score multiplier after the similarities have been calculated. 许多通用文档检索系统认为较短的文档更有价值，但是通常在计算相似度后将其作为分数乘数来处理。

Oftentimes ln(TF) is used instead of raw TF scores as a normalization feature because differences between seeing a term 1and 2 times is way more important than the difference between seeing a term 100 and 200 times; 通常，使用ln（TF）代替原始TF分数作为归一化功能，因为看到术语1和2倍之间的差异比看到术语100和200倍之间的差异更重要； it also keeps excessive use of a term from dominating the vector and is typically much more robust. 它还可以避免过多使用术语来控制向量，并且通常更健壮。