如何通过scikit和pandas获得带有频率的Ngram排名列表？

Question

I am trying to this simple task with scikit but I am having trouble working with the sparse matrix. 我正在尝试使用scikit完成此简单任务，但是在使用稀疏矩阵时遇到了麻烦。 For this, I don't care about document frequency. 为此，我不在乎文档频率。

This is what I have so far: 这是我到目前为止的内容：

vectorizer = CountVectorizer(ngram_range=(1,3))
n_grams = vectorizer.fit_transform(df.column_with_text)

At this point I know I am supposted to do something involving n_grams and inverse_transform , but I'm not sure what. 在这一点上，我知道我应该做一些涉及n_grams和inverse_transform事情，但是我不确定是什么。 I would like a list of [n_gram,frequency] ranked by frequency, like this: 我想要按频率排列的[n_gram，frequency]列表，如下所示：

"apple banana", 100
"this is fun", 100
"cool pandas", 99
...

Thanks. 谢谢。

Answer 1

You get the vocabulary out of your vectoriser with vocabulary_ ; 您可以使用vocabulary__从矢量转换器中提取vocabulary_ ； the values are the columns of the vectorized output corresponding to the keys: 值是对应于键的向量化输出的列：

vectorizer.vocabulary_
{'apple': 0,
 'apple banana': 1,
 'apple banana this': 2,

The frequencies will be the sums of the columns of n_grams , to calculate these it's probably easiest to convert the sparse matrix to a numpy array first with toarray() , then one way to match them up would be with a list comprehension: 频率将是n_grams列的n_grams ，要计算这些n_grams ，最可能首先使用toarray()将稀疏矩阵转换为numpy数组最容易，然后使用toarray()将它们匹配的一种方法：

vocab = vectorizer.vocabulary_
count_values = n_grams.toarray().sum(axis=0)
counts = sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True)

counts
[(4, 'pandas'),
 (4, 'cool pandas'),
 (4, 'cool'),
 (2, 'this is fun'),
 (2, 'this is'),

如何通过scikit和pandas获得带有频率的Ngram排名列表？

问题描述

1 个解决方案

解决方案1
2 2016-03-17 00:41:57

如何通过scikit和pandas获得带有频率的Ngram排名列表？

问题描述

1 个解决方案

解决方案1 2 2016-03-17 00:41:57

解决方案1
2 2016-03-17 00:41:57