您如何将语料库中的所有单词包含在 Gensim TF-IDF 中？

Question

If I have some documents like this:如果我有一些这样的文件：

doc1 = "hello hello this is a document"
doc2 = "this text is very interesting"
documents = [doc1, doc2]

And I compute a TF-IDF matrix for this in Gensim like this:我在 Gensim 中为此计算了一个 TF-IDF 矩阵，如下所示：

# create dictionary
dictionary = corpora.Dictionary([simple_preprocess(line) for line in documents])
# create bow corpus
corpus = [dictionary.doc2bow(simple_preprocess(line)) for line in documents]
# create the tf.idf matrix
tfidf = models.TfidfModel(corpus, smartirs='ntc')

Then for each document, I get a TF-IDF like this:然后对于每个文档，我得到一个像这样的 TF-IDF：

Doc1: [("hello", 0.5), ("a", 0.25), ("document", 0.25)]
Doc2: [("text", 0.333), ("very", 0.333), ("interesting", 0.333)]

But I want the TF-IDF vector for each document to include words with 0 TF-IDF values (ie include every word mentioned in the corpus):但我希望每个文档的 TF-IDF 向量包含 0 TF-IDF 值的单词（即包括语料库中提到的每个单词）：

Doc1: [("hello", 0.5), ("this", 0), ("is", 0), ("a", 0.25), ("document", 0.25), ("text", 0), ("very", 0), ("interesting", 0)]
Doc2: [("hello", 0), ("this", 0), ("is", 0), ("a", 0), ("document", 0), ("text", 0.333), ("very", 0.333), ("interesting", 0.333)]

How can I do this in Gensim?我怎样才能在 Gensim 中做到这一点？ Or maybe there is some other library that can compute a TF-IDF matrix in this fashion (although like Gensim, it needs to be able to handle very large data sets, eg I achieved this result in Sci-kit on a small data set, but Sci-kit has memory problems on a large data set).或者也许还有其他一些库可以以这种方式计算 TF-IDF 矩阵（尽管像 Gensim 一样，它需要能够处理非常大的数据集，例如，我在 Sci-kit 中在一个小数据集上实现了这个结果，但 Sci-kit 在大型数据集上存在 memory 问题）。

Answer 1

The easiest way, in my opinion, is to use the nltk.corpus words.在我看来，最简单的方法是使用nltk.corpus词。 But first, you need to install nltk which can be done easily using pip or conda :但首先，您需要安装nltk ，这可以使用pip或conda轻松完成：

Using pip :使用pip ：

pip install ntlk

Using conda :使用conda ：

conda install -c anaconda nltk.

Now, you can change your dictionary to be like so:现在，您可以将字典更改为：

from nltk.corpus import words

dictionary = corpora.Dictionary([words.words()]*len(documents))

Now, your dictionary has more than +235,000 words现在，您的字典有超过 235,000 个单词

Answer 2

You can use sklearn.TfidfVectorizer to do that.您可以使用sklearn.TfidfVectorizer来做到这一点。 It can be done in just four lines like so:只需四行即可完成，如下所示：

>>> import pandas as pd
>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> corpus = ["hello hello this is a document", "this text is very interesting"]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>>df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
>>> df
   document     hello  interesting       is      text     this      very
0  0.407824  0.815648     0.000000  0.29017  0.000000  0.29017  0.000000
1  0.000000  0.000000     0.499221  0.35520  0.499221  0.35520  0.499221

EDIT编辑

You can convert the tfidf matrix back to gensim using Sparse2Matrix like so:您可以使用Sparse2Matrix将 tfidf 矩阵转换回 gensim，如下所示：

>>> from gensim import matutils
>>> tfidf_mat = matutils.Sparse2Corpus(X, documents_columns=False)

Hope this helps希望这可以帮助

您如何将语料库中的所有单词包含在 Gensim TF-IDF 中？

问题描述

1 个解决方案

解决方案1
0 2019-11-21 06:19:28

解决方案2
0 2019-11-21 16:26:17

EDIT编辑

您如何将语料库中的所有单词包含在 Gensim TF-IDF 中？

问题描述

1 个解决方案

解决方案1 0 2019-11-21 06:19:28

解决方案2 0 2019-11-21 16:26:17

EDIT编辑

解决方案1
0 2019-11-21 06:19:28

解决方案2
0 2019-11-21 16:26:17