如何獲得最高 tf-idf 分數的前 n 個術語 - 大稀疏矩陣

Question

有這個代碼：

feature_array = np.array(tfidf.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]

n = 3
top_n = feature_array[tfidf_sorting][:n]

來自這個答案。

我的問題是，在我的稀疏矩陣太大而無法立即轉換為密集矩陣（使用response.toarray() ）的情況下，我如何有效地做到這一點？

顯然，一般的答案是將稀疏矩陣分成塊，在 for 循環中轉換每個塊，然后將所有塊的結果組合起來。

但我想具體查看執行此操作的代碼。

Answer 1

如果您深入了解該問題，他們tf_idf了解單個文檔的最高tf_idf分數感興趣。

當您想對大型語料庫做同樣的事情時，您需要將所有文檔中每個特征的分數相加（仍然沒有意義，因為分數在TfidfVectorizer()是l2標准化的，請閱讀此處）。 我建議使用.idf_分數來了解具有高逆文檔頻率分數的特征。

如果您想根據出現次數了解主要特征，請使用CountVectorizer()

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
corpus = [
    'I would like to check this document',
    'How about one more document',
    'Aim is to capture the key words from the corpus'
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()

top_n = 3

print('tf_idf scores: \n', sorted(list(zip(vectorizer.get_feature_names(), 
                                             X.sum(0).getA1())), 
                                 key=lambda x: x[1], reverse=True)[:top_n])
# tf_idf scores : 
# [('document', 1.4736296010332683), ('check', 0.6227660078332259), ('like', 0.6227660078332259)]

print('idf values: \n', sorted(list(zip(feature_array,vectorizer.idf_,)),
       key = lambda x: x[1], reverse=True)[:top_n])

# idf values: 
#  [('aim', 1.6931471805599454), ('capture', 1.6931471805599454), ('check', 1.6931471805599454)]

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()
print('Frequency: \n', sorted(list(zip(vectorizer.get_feature_names(), 
                                         X.sum(0).getA1())),
                            key=lambda x: x[1], reverse=True)[:top_n])

# Frequency: 
#  [('document', 2), ('aim', 1), ('capture', 1)]

如何獲得最高 tf-idf 分數的前 n 個術語 - 大稀疏矩陣

問題描述

1 個解決方案

解決方案1
7 已采納 2019-06-22 06:13:55

如何獲得最高 tf-idf 分數的前 n 個術語 - 大稀疏矩陣

問題描述

1 個解決方案

解決方案1 7 已采納 2019-06-22 06:13:55

解決方案1
7 已采納 2019-06-22 06:13:55