简体   繁体   中英

Add tf-idf values as columns in a matrix

from sklearn.feature_extraction.text import TfidfVectorizer

item = list(df['item1']) + list(df['item2'])
tfidf = TfidfVectorizer()
tfidf_sp = tfidf.fit_transform(item)

for i in len(list(df['item1'])):
    new_list =[]
    new_list.append(tfidf.idf_)
df['updated_item'] = list(new_list)

I was trying to add the tfidf scores as features. Is it the correct way?

item1 is of shape (400k) and same is the shape of item2. The shape of tfidf_sp is (800k, 100k).

import pandas as pd

pd.DataFrame(tfidf_sp, columns = tfidf.get_feature_names())

This will give you a matrix with the columns as the tfidf vocabulary and each row containing tfidf values corresponding to each item.

Hope this helps.

Edit:

Try converting the term-document matrix obtained into an array as follows:

tfidf_sp = tfidf.fit_transform(item).toarray()

This will solve the Pandas Error.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM