I have a dataset of articles. Some online examples usually hard-code the corpus. If I want to calculate the TF-IDF of my own dataset, what should I do?
Note: I created a dataframe to store those data. Here is my code
pip install scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
corpus = merged_df['title']
vectorizer = CountVectorizer()
wordFrequency = vectorizer.fit_transform(corpus)
word = vectorizer.get_feature_names()
print(word)
#-----------------------
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(wordFrequency)
You could probably try TfIfdVectorizer instead of CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=5) # min_df Applies to minimum document frequency, not necessary
wordFrequency = vectorizer.fit_transform(corpus)
word = vectorizer.get_feature_names()
print(word)
Basically, just change the Vectorizer you use. Cheers!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.