如何获得 CountVectorizer ngram 频率

Question

I have a dataset of ~50k short texts, average 9 tokens each.我有一个大约 50k 短文本的数据集，每个平均 9 个标记。 They contain a large number of uncommon tokens ('nw', '29203822', 'x989', etc...) as well as regular words and I believe these are degrading my classification efforts.它们包含大量不常见的标记（'nw'、'29203822'、'x989' 等）以及常规单词，我相信这些会降低我的分类工作。 I want to generate a stop word list of the most frequent n-grams that offer no value and remove them.我想生成最常见的不提供任何价值的 n-gram 的停用词列表并将其删除。 I figure the best way is after my Count Vectorizer but before my TF-IDF to get those counts.我认为最好的方法是在我的 Count Vectorizer 之后但在我的 TF-IDF 之前获得这些计数。

count_vect = CountVectorizer(ngram_range=(1,4))
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(19859, 122567)

count_vect.vocabulary_

{'look': 66431,
'1466': 1827,
'cl sign': 23055,
'in': 56587,
...}

I don't see any function for outputting the frequency of these ngrams within the dataset.我没有看到任何 function 用于输出数据集中这些 ngram 的频率。 Is there any?有没有？ Thanks!谢谢！

Answer 1

There is no build function for that (as far as I know), but you can achieve it with following function:没有构建 function （据我所知），但您可以通过以下 function 实现它：

def create_n_gram_frequency(n_gram_from, n_gram_to, corpus):
    vec = CountVectorizer(ngram_range=(n_gram_from, n_gram_to)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis = 0)
    words_freq = [(word, sum_words[0, i]) for word, i in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
    return words_freq

如何获得 CountVectorizer ngram 频率

问题描述

1 个解决方案

解决方案1
0 2022-01-22 06:00:50

如何获得 CountVectorizer ngram 频率

问题描述

1 个解决方案

解决方案1 0 2022-01-22 06:00:50

解决方案1
0 2022-01-22 06:00:50