How do you include all words from the corpus in a Gensim TF-IDF?

Question

If I have some documents like this:

doc1 = "hello hello this is a document"
doc2 = "this text is very interesting"
documents = [doc1, doc2]

And I compute a TF-IDF matrix for this in Gensim like this:

# create dictionary
dictionary = corpora.Dictionary([simple_preprocess(line) for line in documents])
# create bow corpus
corpus = [dictionary.doc2bow(simple_preprocess(line)) for line in documents]
# create the tf.idf matrix
tfidf = models.TfidfModel(corpus, smartirs='ntc')

Then for each document, I get a TF-IDF like this:

Doc1: [("hello", 0.5), ("a", 0.25), ("document", 0.25)]
Doc2: [("text", 0.333), ("very", 0.333), ("interesting", 0.333)]

But I want the TF-IDF vector for each document to include words with 0 TF-IDF values (ie include every word mentioned in the corpus):

Doc1: [("hello", 0.5), ("this", 0), ("is", 0), ("a", 0.25), ("document", 0.25), ("text", 0), ("very", 0), ("interesting", 0)]
Doc2: [("hello", 0), ("this", 0), ("is", 0), ("a", 0), ("document", 0), ("text", 0.333), ("very", 0.333), ("interesting", 0.333)]

How can I do this in Gensim? Or maybe there is some other library that can compute a TF-IDF matrix in this fashion (although like Gensim, it needs to be able to handle very large data sets, eg I achieved this result in Sci-kit on a small data set, but Sci-kit has memory problems on a large data set).

Answer 1

The easiest way, in my opinion, is to use the nltk.corpus words. But first, you need to install nltk which can be done easily using pip or conda :

Using pip :

pip install ntlk

Using conda :

conda install -c anaconda nltk.

Now, you can change your dictionary to be like so:

from nltk.corpus import words

dictionary = corpora.Dictionary([words.words()]*len(documents))

Now, your dictionary has more than +235,000 words

Answer 2

You can use sklearn.TfidfVectorizer to do that. It can be done in just four lines like so:

>>> import pandas as pd
>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> corpus = ["hello hello this is a document", "this text is very interesting"]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>>df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
>>> df
   document     hello  interesting       is      text     this      very
0  0.407824  0.815648     0.000000  0.29017  0.000000  0.29017  0.000000
1  0.000000  0.000000     0.499221  0.35520  0.499221  0.35520  0.499221

EDIT

You can convert the tfidf matrix back to gensim using Sparse2Matrix like so:

>>> from gensim import matutils
>>> tfidf_mat = matutils.Sparse2Corpus(X, documents_columns=False)

Hope this helps

How do you include all words from the corpus in a Gensim TF-IDF?

Question

1 answers

solution1
0 2019-11-21 06:19:28

solution2
0 2019-11-21 16:26:17

EDIT

How do you include all words from the corpus in a Gensim TF-IDF?

Question

1 answers

solution1 0 2019-11-21 06:19:28

solution2 0 2019-11-21 16:26:17

EDIT

solution1
0 2019-11-21 06:19:28

solution2
0 2019-11-21 16:26:17