Find the words with specified tf-idf scores

Question

How can I get the words having the lowest tf-idf scores out of all the words?

tfidf_vect = TfidfVectorizer(analyzer=clean)
print(tfidf_vect.fit_transform(df['text']))

[output]
(0, 11046)  0.1907144678156909
(0, 4791)   0.3125060892887963
(0, 7026)   0.15156899671911586
(0, 1534)   0.3125060892887963
...

I want to get, say words with a score less than 0.1, with their multiple indexes. I am aware that I am working with a csr_matrix and I converted it to an array to work on it more easily, but couldn't make it work out.

Answer 1

The easiest way that comes to my mind is using simple numpy functionality for filtering and then converting it to sparse matrix if it's necessary.

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
threshold = 0.5
df = pd.DataFrame({'text':corpus})
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])
rows, columns = np.where(X.toarray() > threshold)
print(f'Rows: {list(rows)}\nColumns: {list(columns)}')
filtered_sparse_matrix = csr_matrix((X.toarray()[X.toarray()>threshold], (rows, columns)), X.shape)
print(f'Final Matrix:\n{filtered_sparse_matrix}')

output:

Rows: [0, 1, 1, 2, 2, 2, 3]
Columns: [2, 1, 5, 0, 4, 7, 2]
Final Matrix:
  (0, 2)    0.5802858236844359
  (1, 1)    0.6876235979836938
  (1, 5)    0.5386476208856763
  (2, 0)    0.511848512707169
  (2, 4)    0.511848512707169
  (2, 7)    0.511848512707169
  (3, 2)    0.5802858236844359

Find the words with specified tf-idf scores

Question

1 answers

solution1
0 2022-07-13 10:07:31

Find the words with specified tf-idf scores

Question

1 answers

solution1 0 2022-07-13 10:07:31

solution1
0 2022-07-13 10:07:31