how to compute TF-IDF on a specific dataset

Question

I have a dataset of articles. Some online examples usually hard-code the corpus. If I want to calculate the TF-IDF of my own dataset, what should I do?

Note: I created a dataframe to store those data. Here is my code

pip install scikit-learn
from sklearn.feature_extraction.text import CountVectorizer 

corpus = merged_df['title']

vectorizer = CountVectorizer()
wordFrequency = vectorizer.fit_transform(corpus)
word = vectorizer.get_feature_names()

print(word)


#-----------------------

from sklearn.feature_extraction.text import TfidfTransformer 

transformer = TfidfTransformer()
tfidf = transformer.fit_transform(wordFrequency)

Answer 1

You could probably try TfIfdVectorizer instead of CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=5) # min_df Applies to minimum document frequency, not necessary
wordFrequency = vectorizer.fit_transform(corpus)
word = vectorizer.get_feature_names()

print(word)

Basically, just change the Vectorizer you use. Cheers!

how to compute TF-IDF on a specific dataset

Question

1 answers

solution1
0 2020-08-07 04:46:16

how to compute TF-IDF on a specific dataset

Question

1 answers

solution1 0 2020-08-07 04:46:16

solution1
0 2020-08-07 04:46:16