from sklearn.feature_extraction.text import TfidfVectorizer
item = list(df['item1']) + list(df['item2'])
tfidf = TfidfVectorizer()
tfidf_sp = tfidf.fit_transform(item)
for i in len(list(df['item1'])):
new_list =[]
new_list.append(tfidf.idf_)
df['updated_item'] = list(new_list)
I was trying to add the tfidf scores as features. Is it the correct way?
item1 is of shape (400k) and same is the shape of item2. The shape of tfidf_sp is (800k, 100k).
import pandas as pd
pd.DataFrame(tfidf_sp, columns = tfidf.get_feature_names())
This will give you a matrix with the columns as the tfidf vocabulary and each row containing tfidf values corresponding to each item.
Hope this helps.
Edit:
Try converting the term-document matrix obtained into an array as follows:
tfidf_sp = tfidf.fit_transform(item).toarray()
This will solve the Pandas Error.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.