如何通过scikit和pandas获得带有频率的Ngram排名列表？

Question

我正在尝试使用scikit完成此简单任务，但是在使用稀疏矩阵时遇到了麻烦。 为此，我不在乎文档频率。

这是我到目前为止的内容：

vectorizer = CountVectorizer(ngram_range=(1,3))
n_grams = vectorizer.fit_transform(df.column_with_text)

在这一点上，我知道我应该做一些涉及n_grams和inverse_transform事情，但是我不确定是什么。 我想要按频率排列的[n_gram，frequency]列表，如下所示：

"apple banana", 100
"this is fun", 100
"cool pandas", 99
...

谢谢。

Answer 1

您可以使用vocabulary__从矢量转换器中提取vocabulary_ ； 值是对应于键的向量化输出的列：

vectorizer.vocabulary_
{'apple': 0,
 'apple banana': 1,
 'apple banana this': 2,

频率将是n_grams列的n_grams ，要计算这些n_grams ，最可能首先使用toarray()将稀疏矩阵转换为numpy数组最容易，然后使用toarray()将它们匹配的一种方法：

vocab = vectorizer.vocabulary_
count_values = n_grams.toarray().sum(axis=0)
counts = sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True)

counts
[(4, 'pandas'),
 (4, 'cool pandas'),
 (4, 'cool'),
 (2, 'this is fun'),
 (2, 'this is'),

如何通过scikit和pandas获得带有频率的Ngram排名列表？

问题描述

1 个解决方案

解决方案1
2 2016-03-17 00:41:57

如何通过scikit和pandas获得带有频率的Ngram排名列表？

问题描述

1 个解决方案

解决方案1 2 2016-03-17 00:41:57

解决方案1
2 2016-03-17 00:41:57