简体   繁体   English

如何通过scikit和pandas获得带有频率的Ngram排名列表?

[英]How do I get a ranked list of Ngrams with frequencies with scikit and pandas?

I am trying to this simple task with scikit but I am having trouble working with the sparse matrix. 我正在尝试使用scikit完成此简单任务,但是在使用稀疏矩阵时遇到了麻烦。 For this, I don't care about document frequency. 为此,我不在乎文档频率。

This is what I have so far: 这是我到目前为止的内容:

vectorizer = CountVectorizer(ngram_range=(1,3))
n_grams = vectorizer.fit_transform(df.column_with_text)

At this point I know I am supposted to do something involving n_grams and inverse_transform , but I'm not sure what. 在这一点上,我知道我应该做一些涉及n_gramsinverse_transform事情,但是我不确定是什么。 I would like a list of [n_gram,frequency] ranked by frequency, like this: 我想要按频率排列的[n_gram,frequency]列表,如下所示:

"apple banana", 100
"this is fun", 100
"cool pandas", 99
...

Thanks. 谢谢。

You get the vocabulary out of your vectoriser with vocabulary_ ; 您可以使用vocabulary__从矢量转换器中提取vocabulary_ the values are the columns of the vectorized output corresponding to the keys: 值是对应于键的向量化输出的列:

vectorizer.vocabulary_
{'apple': 0,
 'apple banana': 1,
 'apple banana this': 2,

The frequencies will be the sums of the columns of n_grams , to calculate these it's probably easiest to convert the sparse matrix to a numpy array first with toarray() , then one way to match them up would be with a list comprehension: 频率将是n_grams列的n_grams ,要计算这些n_grams ,最可能首先使用toarray()将稀疏矩阵转换为numpy数组最容易,然后使用toarray()将它们匹配的一种方法:

vocab = vectorizer.vocabulary_
count_values = n_grams.toarray().sum(axis=0)
counts = sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True)

counts
[(4, 'pandas'),
 (4, 'cool pandas'),
 (4, 'cool'),
 (2, 'this is fun'),
 (2, 'this is'),

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM