简体   繁体   中英

how to compute TF-IDF on a specific dataset

I have a dataset of articles. Some online examples usually hard-code the corpus. If I want to calculate the TF-IDF of my own dataset, what should I do?

Note: I created a dataframe to store those data. Here is my code

pip install scikit-learn
from sklearn.feature_extraction.text import CountVectorizer 

corpus = merged_df['title']

vectorizer = CountVectorizer()
wordFrequency = vectorizer.fit_transform(corpus)
word = vectorizer.get_feature_names()

print(word)


#-----------------------

from sklearn.feature_extraction.text import TfidfTransformer 

transformer = TfidfTransformer()
tfidf = transformer.fit_transform(wordFrequency)

You could probably try TfIfdVectorizer instead of CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=5) # min_df Applies to minimum document frequency, not necessary
wordFrequency = vectorizer.fit_transform(corpus)
word = vectorizer.get_feature_names()

print(word)

Basically, just change the Vectorizer you use. Cheers!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM