简体繁体 English

（文本分类）处理相同的单词，但来自不同的文件[TFIDF]

[英](Text Classification) Handling same words but from different documents [TFIDF ]

原文 2014-03-03 22:58:00 0 1 python/ text/ machine-learning/ classification/ tf-idf

So I'm making a python class which calculates the tfidf weight of each word in a document. 所以我正在创建一个python类来计算文档中每个单词的tfidf权重。 Now in my dataset I have 50 documents. 现在在我的数据集中，我有50个文档。 In these documents many words intersect, thus having multiple same word features but with different tfidf weight. 在这些文献中，许多单词相交，因此具有多个相同的单词特征但具有不同的tfidf权重。 So the question is how do I sum up all the weights into one singular weight? 所以问题是如何将所有权重总结为一个单一权重？

1 个解决方案

First, let's get some terminology clear. 首先，让我们明确一些术语。 A term is a word-like unit in a corpus. 术语是语料库中的单词单元。 A token is a term at a particular location in a particular document. 令牌是特定文档中特定位置的术语。 There can be multiple tokens that use the same term. 可以有多个使用相同术语的令牌。 For example, in my answer, there are many tokens that use the term "the". 例如，在我的回答中，有许多令牌使用术语“the”。 But there is only one term for "the". 但是只有一个术语为“the”。

I think you are a little bit confused. 我觉得你有点困惑。 TF-IDF style weighting functions specify how to make a per term score out of the term's token frequency in a document and the background token document frequency in the corpus for each term in a document. TF-IDF样式加权函数指定如何从文档中的术语的令牌频率和文档中的每个术语的语料库中的背景令牌文档频率中产生每个术语得分。 TF-IDF converts a document into a mapping of terms to weights. TF-IDF将文档转换为术语到权重的映射。 So more tokens sharing the same term in a document will increase the corresponding weight for the term, but there will only be one weight per term. 因此，在文档中共享相同术语的更多令牌将增加该术语的相应权重，但每个术语只有一个权重。 There is no separate score for tokens sharing a term inside the doc. 在文档中共享术语的令牌没有单独的分数。