简体   繁体   English

您如何将语料库中的所有单词包含在 Gensim TF-IDF 中?

[英]How do you include all words from the corpus in a Gensim TF-IDF?

If I have some documents like this:如果我有一些这样的文件:

doc1 = "hello hello this is a document"
doc2 = "this text is very interesting"
documents = [doc1, doc2]

And I compute a TF-IDF matrix for this in Gensim like this:我在 Gensim 中为此计算了一个 TF-IDF 矩阵,如下所示:

# create dictionary
dictionary = corpora.Dictionary([simple_preprocess(line) for line in documents])
# create bow corpus
corpus = [dictionary.doc2bow(simple_preprocess(line)) for line in documents]
# create the tf.idf matrix
tfidf = models.TfidfModel(corpus, smartirs='ntc')

Then for each document, I get a TF-IDF like this:然后对于每个文档,我得到一个像这样的 TF-IDF:

Doc1: [("hello", 0.5), ("a", 0.25), ("document", 0.25)]
Doc2: [("text", 0.333), ("very", 0.333), ("interesting", 0.333)]

But I want the TF-IDF vector for each document to include words with 0 TF-IDF values (ie include every word mentioned in the corpus):但我希望每个文档的 TF-IDF 向量包含 0 TF-IDF 值的单词(即包括语料库中提到的每个单词):

Doc1: [("hello", 0.5), ("this", 0), ("is", 0), ("a", 0.25), ("document", 0.25), ("text", 0), ("very", 0), ("interesting", 0)]
Doc2: [("hello", 0), ("this", 0), ("is", 0), ("a", 0), ("document", 0), ("text", 0.333), ("very", 0.333), ("interesting", 0.333)]

How can I do this in Gensim?我怎样才能在 Gensim 中做到这一点? Or maybe there is some other library that can compute a TF-IDF matrix in this fashion (although like Gensim, it needs to be able to handle very large data sets, eg I achieved this result in Sci-kit on a small data set, but Sci-kit has memory problems on a large data set).或者也许还有其他一些库可以以这种方式计算 TF-IDF 矩阵(尽管像 Gensim 一样,它需要能够处理非常大的数据集,例如,我在 Sci-kit 中在一个小数据集上实现了这个结果,但 Sci-kit 在大型数据集上存在 memory 问题)。

The easiest way, in my opinion, is to use the nltk.corpus words.在我看来,最简单的方法是使用nltk.corpus词。 But first, you need to install nltk which can be done easily using pip or conda :但首先,您需要安装nltk ,这可以使用pipconda轻松完成:

  • Using pip :使用pip
pip install ntlk
  • Using conda :使用conda
conda install -c anaconda nltk.

Now, you can change your dictionary to be like so:现在,您可以将字典更改为:

from nltk.corpus import words

dictionary = corpora.Dictionary([words.words()]*len(documents))

Now, your dictionary has more than +235,000 words现在,您的字典有超过 235,000 个单词

You can use sklearn.TfidfVectorizer to do that.您可以使用sklearn.TfidfVectorizer来做到这一点。 It can be done in just four lines like so:只需四行即可完成,如下所示:

>>> import pandas as pd
>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> corpus = ["hello hello this is a document", "this text is very interesting"]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>>df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
>>> df
   document     hello  interesting       is      text     this      very
0  0.407824  0.815648     0.000000  0.29017  0.000000  0.29017  0.000000
1  0.000000  0.000000     0.499221  0.35520  0.499221  0.35520  0.499221

EDIT编辑

You can convert the tfidf matrix back to gensim using Sparse2Matrix like so:您可以使用Sparse2Matrix将 tfidf 矩阵转换回 gensim,如下所示:

>>> from gensim import matutils
>>> tfidf_mat = matutils.Sparse2Corpus(X, documents_columns=False)

Hope this helps希望这可以帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM