简体   繁体   English

如何获得 CountVectorizer ngram 频率

[英]How to get CountVectorizer ngram frequencies

I have a dataset of ~50k short texts, average 9 tokens each.我有一个大约 50k 短文本的数据集,每个平均 9 个标记。 They contain a large number of uncommon tokens ('nw', '29203822', 'x989', etc...) as well as regular words and I believe these are degrading my classification efforts.它们包含大量不常见的标记('nw'、'29203822'、'x989' 等)以及常规单词,我相信这些会降低我的分类工作。 I want to generate a stop word list of the most frequent n-grams that offer no value and remove them.我想生成最常见的不提供任何价值的 n-gram 的停用词列表并将其删除。 I figure the best way is after my Count Vectorizer but before my TF-IDF to get those counts.我认为最好的方法是在我的 Count Vectorizer 之后但在我的 TF-IDF 之前获得这些计数。

count_vect = CountVectorizer(ngram_range=(1,4))
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(19859, 122567)

count_vect.vocabulary_

{'look': 66431,
'1466': 1827,
'cl sign': 23055,
'in': 56587,
...}

I don't see any function for outputting the frequency of these ngrams within the dataset.我没有看到任何 function 用于输出数据集中这些 ngram 的频率。 Is there any?有没有? Thanks!谢谢!

There is no build function for that (as far as I know), but you can achieve it with following function:没有构建 function (据我所知),但您可以通过以下 function 实现它:

def create_n_gram_frequency(n_gram_from, n_gram_to, corpus):
    vec = CountVectorizer(ngram_range=(n_gram_from, n_gram_to)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis = 0)
    words_freq = [(word, sum_words[0, i]) for word, i in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
    return words_freq

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何有效地使用CountVectorizer来获取目录中所有文件的ngram计数? - How to efficiently use CountVectorizer to get ngram counts for all files in a directory combined? CountVectorizer max_features如何处理相同频率的ngram? - How does CountVectorizer max_features process ngrams with the same frequencies? 了解 sklearn 中 CountVectorizer 中的 `ngram_range` 参数 - Understanding the `ngram_range` argument in a CountVectorizer in sklearn 如何分组并获得最频繁的ngram? - How to group-by and get most frequent ngram? 如何执行ngram与ngram的关联 - How to perform ngram to ngram association 如何从 CountVectorizer output 中获取具体的单词? - How to get the specific words from CountVectorizer output? sklearn CountVectorizer TypeError:拒绝除(1,1)之外的'ngram_range' - sklearn CountVectorizer TypeError: refuses 'ngram_range' other than (1,1) 计算nGram时,Sklearn CountVectorizer在Dataframe中出现“空词汇”错误 - Sklearn CountVectorizer “Empty Vocabulary” error in Dataframe when computing nGram 从CountVectorizer中按类提取n个最高频率 - Extracting n-highest frequencies by class from CountVectorizer 如何获取 wav 文件中的频率列表 - How to get a list of frequencies in a wav file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM