[英]Find the words with specified tf-idf scores
如何從所有單詞中獲得 tf-idf 分數最低的單詞?
tfidf_vect = TfidfVectorizer(analyzer=clean)
print(tfidf_vect.fit_transform(df['text']))
[output]
(0, 11046) 0.1907144678156909
(0, 4791) 0.3125060892887963
(0, 7026) 0.15156899671911586
(0, 1534) 0.3125060892887963
...
我想用它們的多個索引說出分數低於 0.1 的單詞。 我知道我正在使用 csr_matrix 並將其轉換為數組以更輕松地對其進行處理,但無法使其正常工作。
我想到的最簡單的方法是使用簡單的numpy
功能進行過濾,然后在必要時將其轉換為sparse matrix
。
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
threshold = 0.5
df = pd.DataFrame({'text':corpus})
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])
rows, columns = np.where(X.toarray() > threshold)
print(f'Rows: {list(rows)}\nColumns: {list(columns)}')
filtered_sparse_matrix = csr_matrix((X.toarray()[X.toarray()>threshold], (rows, columns)), X.shape)
print(f'Final Matrix:\n{filtered_sparse_matrix}')
輸出:
Rows: [0, 1, 1, 2, 2, 2, 3]
Columns: [2, 1, 5, 0, 4, 7, 2]
Final Matrix:
(0, 2) 0.5802858236844359
(1, 1) 0.6876235979836938
(1, 5) 0.5386476208856763
(2, 0) 0.511848512707169
(2, 4) 0.511848512707169
(2, 7) 0.511848512707169
(3, 2) 0.5802858236844359
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.