在 Python 中创建稀疏词矩阵（词袋）

Question

I have a list of text files in a directory.我有一个目录中的文本文件列表。

I'd like to create a matrix with the frequency of each word in the entire corpus in every file.我想创建一个矩阵，其中包含每个文件中整个语料库中每个单词的频率。 (The corpus is every unique word in every file in the directory.) （语料库是目录中每个文件中的每个唯一单词。）

Example:例子：

File 1 - "aaa", "xyz", "cccc", "dddd", "aaa"  
File 2 - "abc", "aaa"
Corpus - "aaa", "abc", "cccc", "dddd", "xyz"

Output matrix:输出矩阵：

[[2, 0, 1, 1, 1],
 [1, 1, 0, 0, 0]]

My solution is to use collections.Counter over every file, get a dictionary with the count of every word, and initialize and a list of lists with size n × m ( n = number of files, m = number of unique words in corpus).我的解决方案是对每个文件使用collections.Counter ，获取包含每个单词计数的字典，并初始化一个大小为n × m的列表列表（ n = 文件数， m = 语料库中唯一单词的数量） . Then, I iterate over every file again to see the frequency of every word in the object, and fill each list with it.然后，我再次遍历每个文件以查看对象中每个单词的频率，并用它填充每个列表。

Is there a better way to solve this problem?有没有更好的方法来解决这个问题？ Maybe in a single pass using collections.Counter ?也许在一次使用collections.Counter ？

Answer 1

Below is a fairly simple solution which uses sklearn.feature_extraction.DictVectorizer .下面是一个相当简单的解决方案，它使用sklearn.feature_extraction.DictVectorizer 。

from sklearn.feature_extraction import DictVectorizer
from collections import Counter, OrderedDict

File_1 = ('aaa', 'xyz', 'cccc', 'dddd', 'aaa')
File_2 = ('abc', 'aaa')

v = DictVectorizer()

# discover corpus and vectorize file word frequencies in a single pass
X = v.fit_transform(Counter(f) for f in (File_1, File_2))

# or, if you have a pre-defined corpus and/or would like to restrict the words you consider
# in your matrix, you can do

# Corpus = ('aaa', 'bbb', 'cccc', 'dddd', 'xyz')
# v.fit([OrderedDict.fromkeys(Corpus, 1)])
# X = v.transform(Counter(f) for f in (File_1, File_2))

# X is a sparse matrix, but you can access the A property to get a dense numpy.ndarray 
# representation
print(X)
print(X.A)

<2x5 sparse matrix of type '<type 'numpy.float64'>'
        with 6 stored elements in Compressed Sparse Row format>
array([[ 2.,  0.,  1.,  1.,  1.],
       [ 1.,  1.,  0.,  0.,  0.]])

The mapping from words to indices can be accessed via v.vocabulary_ .可以通过v.vocabulary_访问从单词到索引的映射。

{'aaa': 0, 'bbb': 1, 'cccc': 2, 'dddd': 3, 'xyz': 4}

在 Python 中创建稀疏词矩阵（词袋）

问题描述

1 个解决方案

解决方案1
5 已采纳 2017-10-27 00:13:11

在 Python 中创建稀疏词矩阵（词袋）

问题描述

1 个解决方案

解决方案1 5 已采纳 2017-10-27 00:13:11

解决方案1
5 已采纳 2017-10-27 00:13:11