您如何將語料庫中的所有單詞包含在 Gensim TF-IDF 中？

Question

如果我有一些這樣的文件：

doc1 = "hello hello this is a document"
doc2 = "this text is very interesting"
documents = [doc1, doc2]

我在 Gensim 中為此計算了一個 TF-IDF 矩陣，如下所示：

# create dictionary
dictionary = corpora.Dictionary([simple_preprocess(line) for line in documents])
# create bow corpus
corpus = [dictionary.doc2bow(simple_preprocess(line)) for line in documents]
# create the tf.idf matrix
tfidf = models.TfidfModel(corpus, smartirs='ntc')

然后對於每個文檔，我得到一個像這樣的 TF-IDF：

Doc1: [("hello", 0.5), ("a", 0.25), ("document", 0.25)]
Doc2: [("text", 0.333), ("very", 0.333), ("interesting", 0.333)]

但我希望每個文檔的 TF-IDF 向量包含 0 TF-IDF 值的單詞（即包括語料庫中提到的每個單詞）：

Doc1: [("hello", 0.5), ("this", 0), ("is", 0), ("a", 0.25), ("document", 0.25), ("text", 0), ("very", 0), ("interesting", 0)]
Doc2: [("hello", 0), ("this", 0), ("is", 0), ("a", 0), ("document", 0), ("text", 0.333), ("very", 0.333), ("interesting", 0.333)]

我怎樣才能在 Gensim 中做到這一點？ 或者也許還有其他一些庫可以以這種方式計算 TF-IDF 矩陣（盡管像 Gensim 一樣，它需要能夠處理非常大的數據集，例如，我在 Sci-kit 中在一個小數據集上實現了這個結果，但 Sci-kit 在大型數據集上存在 memory 問題）。

Answer 1

在我看來，最簡單的方法是使用nltk.corpus詞。 但首先，您需要安裝nltk ，這可以使用pip或conda輕松完成：

使用pip ：

pip install ntlk

使用conda ：

conda install -c anaconda nltk.

現在，您可以將字典更改為：

from nltk.corpus import words

dictionary = corpora.Dictionary([words.words()]*len(documents))

現在，您的字典有超過 235,000 個單詞

Answer 2

您可以使用sklearn.TfidfVectorizer來做到這一點。 只需四行即可完成，如下所示：

>>> import pandas as pd
>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> corpus = ["hello hello this is a document", "this text is very interesting"]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>>df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
>>> df
   document     hello  interesting       is      text     this      very
0  0.407824  0.815648     0.000000  0.29017  0.000000  0.29017  0.000000
1  0.000000  0.000000     0.499221  0.35520  0.499221  0.35520  0.499221

編輯

您可以使用Sparse2Matrix將 tfidf 矩陣轉換回 gensim，如下所示：

>>> from gensim import matutils
>>> tfidf_mat = matutils.Sparse2Corpus(X, documents_columns=False)

希望這可以幫助

您如何將語料庫中的所有單詞包含在 Gensim TF-IDF 中？

問題描述

1 個解決方案

解決方案1
0 2019-11-21 06:19:28

解決方案2
0 2019-11-21 16:26:17

編輯

您如何將語料庫中的所有單詞包含在 Gensim TF-IDF 中？

問題描述

1 個解決方案

解決方案1 0 2019-11-21 06:19:28

解決方案2 0 2019-11-21 16:26:17

編輯

解決方案1
0 2019-11-21 06:19:28

解決方案2
0 2019-11-21 16:26:17