简体   繁体   中英

Get the top term per document - scikit tf-idf

Afte vectorizing multiple documents with scikit's tf-idf vectorizer , is there a way to get the most 'influential' term per document?

I have only found ways of getting the most 'influential' terms for the entire corpus, not for each document, though.

Just adding one more way of doing this, in the last two steps of Ami :

# Get a list of all the keywords by calling function
feature_names = np.array(count_vect.get_feature_names())
feature_names[X_train_tfidf.argmax(axis=1)]

Say you start with a dataset:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
from sklearn.datasets import fetch_20newsgroups

d = fetch_20newsgroups()

Use a count vectorizer and tfidf:

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(d.data)
transformer = TfidfTransformer()
X_train_tfidf = transformer.fit_transform(X_train_counts)

Now you can create an inverse mapping:

m = {v: k for (k, v) in count_vect.vocabulary_.items()}

and this gives the influential word per doc:

[m[t] for t in np.array(np.argmax(X_train_tfidf, axis=1)).flatten()]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM