I want to cluster documents using python. First I generate document x terms matrix with tf-idf score as below:
tfidf_vectorizer_desc = TfidfVectorizer(min_df=1, max_df=0.9,use_idf=True, tokenizer=tokenize_and_stem)
%time tfidf_matrix_desc = tfidf_vectorizer_desc.fit_transform(descriptions) #fit the vectorizer to text
desc_feature_names = tfidf_vectorizer_desc.get_feature_names()
The matrix shape is (1510, 6862)
The score of each terms of the first document:
dense = tfidf_matrix_desc.todense()
print(len(dense[0].tolist()[0]))
dataset0 = dense[0].tolist()[0]
phrase_scores = [pair for pair in zip(range(0, len(dataset0)), dataset0) if pair[1] > 0]
print(len(phrase_scores))
Output :
Now I want to identify all features (terms) that have 0 tfidf score for a given dataset from the matrix. How can I achieve this?
for col in tfidf_matrix_desc.nonzero()[1]:
print(feature_names[col], ' - ', tfidf_matrix[0, col])
Just in case anyone would need something similar, what I use is the following:
# Xtr is the output sparse matrix from TfidfVectorizer
# min_tfidf is a threshold for defining the "new" 0
def remove_zero_tf_idf(Xtr, min_tfidf=0.04):
D = Xtr.toarray() # convert to dense if you want
D[D < min_tfidf] = 0
tfidf_means = np.mean(D, axis=0) # find features that are 0 in all documents
D = np.delete(D, np.where(tfidf_means == 0)[0], axis=1) # delete them from the matrix
return D
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.