[英]How to get CountVectorizer ngram frequencies
I have a dataset of ~50k short texts, average 9 tokens each.我有一个大约 50k 短文本的数据集,每个平均 9 个标记。 They contain a large number of uncommon tokens ('nw', '29203822', 'x989', etc...) as well as regular words and I believe these are degrading my classification efforts.
它们包含大量不常见的标记('nw'、'29203822'、'x989' 等)以及常规单词,我相信这些会降低我的分类工作。 I want to generate a stop word list of the most frequent n-grams that offer no value and remove them.
我想生成最常见的不提供任何价值的 n-gram 的停用词列表并将其删除。 I figure the best way is after my Count Vectorizer but before my TF-IDF to get those counts.
我认为最好的方法是在我的 Count Vectorizer 之后但在我的 TF-IDF 之前获得这些计数。
count_vect = CountVectorizer(ngram_range=(1,4))
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
(19859, 122567)
count_vect.vocabulary_
{'look': 66431,
'1466': 1827,
'cl sign': 23055,
'in': 56587,
...}
I don't see any function for outputting the frequency of these ngrams within the dataset.我没有看到任何 function 用于输出数据集中这些 ngram 的频率。 Is there any?
有没有? Thanks!
谢谢!
There is no build function for that (as far as I know), but you can achieve it with following function:没有构建 function (据我所知),但您可以通过以下 function 实现它:
def create_n_gram_frequency(n_gram_from, n_gram_to, corpus):
vec = CountVectorizer(ngram_range=(n_gram_from, n_gram_to)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis = 0)
words_freq = [(word, sum_words[0, i]) for word, i in vec.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
return words_freq
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.