简体   繁体   English

Vectorizer fit_transform 按文档计数

[英]Vectorizer fit_transform count by document

I have a scipy.sparse.csr.csr_matrix with shape 2500, 2003 and I'm trying to get the top 5 words from each document.我有一个形状为 2500、2003 的 scipy.sparse.csr.csr_matrix,我试图从每个文档中获取前 5 个单词。 How would you transform something that looks like this你将如何改变看起来像这样的东西

(0, 1)  1

(0, 2)  1

(0, 6)  1

(1, 5)  2

(1, 1)  1

(2, 4)  1

(2, 7)  1

(3, 1)  1

etc into a count of the words by document?等按文档统计单词? Thanks!谢谢!

You need to find 5 sorted indices in each row of a sparse matrix.您需要在稀疏矩阵的每一行中找到 5 个排序索引。 Please see similar questions: Scipy.sparse.csr_matrix: How to get top ten values and indices?请参阅类似问题: Scipy.sparse.csr_matrix:如何获取前十个值和索引?

Finding the top n values in a row of a scipy sparse matrix 查找 scipy 稀疏矩阵的一行中的前 n 个值

You can convert your sparse matrix to dense and use something like np.argsort(arr, axis=1) but it would be very inefficient in case of large matrices.您可以将稀疏矩阵转换为密集矩阵并使用np.argsort(arr, axis=1)之类的东西,但在大型矩阵的情况下效率会非常低。

So the more efficient way is to implement it yourself:所以更有效的方法是自己实现:

import numpy as np
from scipy import sparse
from sklearn.feature_extraction.text import CountVectorizer


def top_k_sparse(mat: sparse.csr_matrix, n: int = 5):
    top_indices = []
    top_values = []
    for left, right in zip(mat.indptr[:-1], mat.indptr[1:]):
        n_row_pick = min(n, right - left)
        # At first use argpartition as it will be faster on large arrays
        row_top = np.argpartition(mat.data[left:right], -n_row_pick)[-n_row_pick:]
        # Reorder top indices
        row_top = row_top[np.argsort(-mat.data[left:right][row_top])]

        top_indices.append(mat.indices[left + row_top])
        top_values.append(mat.data[left + row_top])

    return top_indices, top_values


documents = [
    "qwe asd zxc a5 a5 a5 a5 a5 b3 b3 b3 c4 c4 c4 c4 fgf rtrt e2 e2 d2 d2",
    "a5 a5 a5 a5 a5 b3 b3 b3 c4 c4 c4 c4 fgf rtrt e2 e2 qwe asd zxc d2 zxc",
]
vectorizer = CountVectorizer()
sparse_mat = vectorizer.fit_transform(documents)
top_indices, top_values = top_k_sparse(sparse_mat, 5)

inverse_vocabulary = {v: k for k, v in vectorizer.vocabulary_.items()}
for row_indices, row_counts in zip(top_indices, top_values):
    words = [inverse_vocabulary[idx] for idx in row_indices]
    print(words, row_counts)

I used slightly modified implementation by Louis Yang from here .我从这里使用了Louis Yang稍作修改的实现。

Also if you need just top words you may consider using collections.Counter :此外,如果您只需要最重要的词,您可以考虑使用collections.Counter

from collections import Counter

top_words = [Counter(doc.split(" ")).most_common(5) for doc in documents]
print(top_words)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM