If I have some documents like this:
doc1 = "hello hello this is a document"
doc2 = "this text is very interesting"
documents = [doc1, doc2]
And I compute a TF-IDF matrix for this in Gensim like this:
# create dictionary
dictionary = corpora.Dictionary([simple_preprocess(line) for line in documents])
# create bow corpus
corpus = [dictionary.doc2bow(simple_preprocess(line)) for line in documents]
# create the tf.idf matrix
tfidf = models.TfidfModel(corpus, smartirs='ntc')
Then for each document, I get a TF-IDF like this:
Doc1: [("hello", 0.5), ("a", 0.25), ("document", 0.25)]
Doc2: [("text", 0.333), ("very", 0.333), ("interesting", 0.333)]
But I want the TF-IDF vector for each document to include words with 0 TF-IDF values (ie include every word mentioned in the corpus):
Doc1: [("hello", 0.5), ("this", 0), ("is", 0), ("a", 0.25), ("document", 0.25), ("text", 0), ("very", 0), ("interesting", 0)]
Doc2: [("hello", 0), ("this", 0), ("is", 0), ("a", 0), ("document", 0), ("text", 0.333), ("very", 0.333), ("interesting", 0.333)]
How can I do this in Gensim? Or maybe there is some other library that can compute a TF-IDF matrix in this fashion (although like Gensim, it needs to be able to handle very large data sets, eg I achieved this result in Sci-kit on a small data set, but Sci-kit has memory problems on a large data set).
The easiest way, in my opinion, is to use the nltk.corpus
words. But first, you need to install nltk
which can be done easily using pip
or conda
:
pip
:pip install ntlk
conda
:conda install -c anaconda nltk.
Now, you can change your dictionary to be like so:
from nltk.corpus import words
dictionary = corpora.Dictionary([words.words()]*len(documents))
Now, your dictionary has more than +235,000 words
You can use sklearn.TfidfVectorizer
to do that. It can be done in just four lines like so:
>>> import pandas as pd
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = ["hello hello this is a document", "this text is very interesting"]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>>df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
>>> df
document hello interesting is text this very
0 0.407824 0.815648 0.000000 0.29017 0.000000 0.29017 0.000000
1 0.000000 0.000000 0.499221 0.35520 0.499221 0.35520 0.499221
You can convert the tfidf matrix back to gensim using Sparse2Matrix
like so:
>>> from gensim import matutils
>>> tfidf_mat = matutils.Sparse2Corpus(X, documents_columns=False)
Hope this helps
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.