繁体   English   中英

TfidfVectorizer删除tf-idf得分为零的功能

[英]TfidfVectorizer remove features with zero tf-idf score

我想使用python集群文档。 首先,我生成带有tf-idf得分的文档x术语矩阵,如下所示:

tfidf_vectorizer_desc = TfidfVectorizer(min_df=1, max_df=0.9,use_idf=True, tokenizer=tokenize_and_stem)
%time tfidf_matrix_desc = tfidf_vectorizer_desc.fit_transform(descriptions) #fit the vectorizer to text
desc_feature_names = tfidf_vectorizer_desc.get_feature_names()

矩阵形状为(1510,6862)

第一份文件中每项的得分:

dense = tfidf_matrix_desc.todense()
print(len(dense[0].tolist()[0]))
dataset0 = dense[0].tolist()[0] 
phrase_scores = [pair for pair in zip(range(0, len(dataset0)), dataset0) if pair[1] > 0]
print(len(phrase_scores))

输出:

  • print(len(dense [0] .tolist()[0]))-> 6862
  • print(len(phrase_scores))-> 48 *第一个文档仅包含48个大于0.0的术语。

现在,我想从矩阵中识别出给定数据集的tfidf得分为0的所有特征(术语)。 我该如何实现?

for col in tfidf_matrix_desc.nonzero()[1]:
    print(feature_names[col], ' - ', tfidf_matrix[0, col])

万一有人需要类似的东西,我将使用以下内容:

# Xtr is the output sparse matrix from TfidfVectorizer
# min_tfidf is a threshold for defining the "new" 0
def remove_zero_tf_idf(Xtr, min_tfidf=0.04):
    D = Xtr.toarray() # convert to dense if you want
    D[D < min_tfidf] = 0
    tfidf_means = np.mean(D, axis=0) # find features that are 0 in all documents
    D = np.delete(D, np.where(tfidf_means == 0)[0], axis=1) # delete them from the matrix
    return D

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM