了解 TfidfVectorizer 中的前 n 個 tfidf 功能

Question

我試圖更好地理解TfidfVectorizer的scikit-learn 。 下面的代碼有兩個文檔doc1 = The car is driven on the road ， doc2 = The truck is driven on the highway 。 通過調用fit_transform生成 tf-idf 權重的矢量化矩陣。

根據tf-idf值矩陣， highway,truck,car不應該是最重要的詞，而不是highway,truck,driven為highway = truck= car= 0.63 and driven = 0.44嗎？

#testing tfidfvectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(tokenizer= lambda x:x.split(),stop_words = 'english')
response = vectorizer.fit_transform(tn)

feature_array = np.array(vectorizer.get_feature_names()) #list of features
print(feature_array)
print(response.toarray())

sorted_features = np.argsort(response.toarray()).flatten()[:-1] #index of highest valued features
print(sorted_features)

#printing top 3 weighted features
n = 3
top_n = feature_array[sorted_features][:n]
print(top_n)

['car' 'driven' 'highway' 'road' 'truck']
[[0.6316672  0.44943642 0.         0.6316672  0.        ]
 [0.         0.44943642 0.6316672  0.         0.6316672 ]]
[2 4 1 0 3 0 3 1 2]
['highway' 'truck' 'driven']

Answer 1

從結果可以看出，tf-idf 矩陣確實給了highway 、 truck 、 car （和truck ）更高的分數：

tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(stop_words = 'english')
response = vectorizer.fit_transform(tn)
terms = vectorizer.get_feature_names()

pd.DataFrame(response.toarray(), columns=terms)

        car    driven   highway      road     truck
0  0.631667  0.449436  0.000000  0.631667  0.000000
1  0.000000  0.449436  0.631667  0.000000  0.631667

問題是您通過展平陣列進行的進一步檢查。 要獲得所有行的最高分，您可以改為執行以下操作：

max_scores = response.toarray().max(0).argsort()
np.array(terms)[max_scores[-4:]]
array(['car', 'highway', 'road', 'truck'], dtype='<U7')

其中得分最高的是在 dataframe 中得分為0.63的 feature_names。

了解 TfidfVectorizer 中的前 n 個 tfidf 功能

問題描述

1 個解決方案

解決方案1
1 已采納 2020-05-06 21:36:10

了解 TfidfVectorizer 中的前 n 個 tfidf 功能

問題描述

1 個解決方案

解決方案1 1 已采納 2020-05-06 21:36:10

解決方案1
1 已采納 2020-05-06 21:36:10