简体   繁体   中英

How do you include all words from the corpus in a Gensim TF-IDF?

If I have some documents like this:

doc1 = "hello hello this is a document"
doc2 = "this text is very interesting"
documents = [doc1, doc2]

And I compute a TF-IDF matrix for this in Gensim like this:

# create dictionary
dictionary = corpora.Dictionary([simple_preprocess(line) for line in documents])
# create bow corpus
corpus = [dictionary.doc2bow(simple_preprocess(line)) for line in documents]
# create the tf.idf matrix
tfidf = models.TfidfModel(corpus, smartirs='ntc')

Then for each document, I get a TF-IDF like this:

Doc1: [("hello", 0.5), ("a", 0.25), ("document", 0.25)]
Doc2: [("text", 0.333), ("very", 0.333), ("interesting", 0.333)]

But I want the TF-IDF vector for each document to include words with 0 TF-IDF values (ie include every word mentioned in the corpus):

Doc1: [("hello", 0.5), ("this", 0), ("is", 0), ("a", 0.25), ("document", 0.25), ("text", 0), ("very", 0), ("interesting", 0)]
Doc2: [("hello", 0), ("this", 0), ("is", 0), ("a", 0), ("document", 0), ("text", 0.333), ("very", 0.333), ("interesting", 0.333)]

How can I do this in Gensim? Or maybe there is some other library that can compute a TF-IDF matrix in this fashion (although like Gensim, it needs to be able to handle very large data sets, eg I achieved this result in Sci-kit on a small data set, but Sci-kit has memory problems on a large data set).

The easiest way, in my opinion, is to use the nltk.corpus words. But first, you need to install nltk which can be done easily using pip or conda :

  • Using pip :
pip install ntlk
  • Using conda :
conda install -c anaconda nltk.

Now, you can change your dictionary to be like so:

from nltk.corpus import words

dictionary = corpora.Dictionary([words.words()]*len(documents))

Now, your dictionary has more than +235,000 words

You can use sklearn.TfidfVectorizer to do that. It can be done in just four lines like so:

>>> import pandas as pd
>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> corpus = ["hello hello this is a document", "this text is very interesting"]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>>df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
>>> df
   document     hello  interesting       is      text     this      very
0  0.407824  0.815648     0.000000  0.29017  0.000000  0.29017  0.000000
1  0.000000  0.000000     0.499221  0.35520  0.499221  0.35520  0.499221

EDIT

You can convert the tfidf matrix back to gensim using Sparse2Matrix like so:

>>> from gensim import matutils
>>> tfidf_mat = matutils.Sparse2Corpus(X, documents_columns=False)

Hope this helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM