How can I get the words having the lowest tf-idf scores out of all the words?
tfidf_vect = TfidfVectorizer(analyzer=clean)
print(tfidf_vect.fit_transform(df['text']))
[output]
(0, 11046) 0.1907144678156909
(0, 4791) 0.3125060892887963
(0, 7026) 0.15156899671911586
(0, 1534) 0.3125060892887963
...
I want to get, say words with a score less than 0.1, with their multiple indexes. I am aware that I am working with a csr_matrix and I converted it to an array to work on it more easily, but couldn't make it work out.
The easiest way that comes to my mind is using simple numpy
functionality for filtering and then converting it to sparse matrix
if it's necessary.
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
threshold = 0.5
df = pd.DataFrame({'text':corpus})
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])
rows, columns = np.where(X.toarray() > threshold)
print(f'Rows: {list(rows)}\nColumns: {list(columns)}')
filtered_sparse_matrix = csr_matrix((X.toarray()[X.toarray()>threshold], (rows, columns)), X.shape)
print(f'Final Matrix:\n{filtered_sparse_matrix}')
output:
Rows: [0, 1, 1, 2, 2, 2, 3]
Columns: [2, 1, 5, 0, 4, 7, 2]
Final Matrix:
(0, 2) 0.5802858236844359
(1, 1) 0.6876235979836938
(1, 5) 0.5386476208856763
(2, 0) 0.511848512707169
(2, 4) 0.511848512707169
(2, 7) 0.511848512707169
(3, 2) 0.5802858236844359
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.