删除词汇表 TF-IDF 中单次出现的单词

Question

I am attempting to remove words that occur once in my vocabulary to reduce my vocabulary size.我试图删除在我的词汇表中出现一次的单词以减少我的词汇量。 I am using the sklearn TfidfVectorizer() and then the fit_transform function on my data frame.我在我的数据框中使用 sklearn TfidfVectorizer() 和 fit_transform 函数。

tfidf = TfidfVectorizer()  
tfs = tfidf.fit_transform(df['original_post'].values.astype('U'))

My first thought is the preprocessor field in the tfidf vectorizer or using the preprocessing package before machine learning.我的第一个想法是tfidf向量化器中的预处理器字段或使用机器学习之前的预处理包。

Any tips or links to further implementation?任何进一步实施的提示或链接？

Answer 1

you are looking for min_df param (minimum frequency), from the documentation of scikit-learn TfidfVectorizer :您正在从 scikit-learn TfidfVectorizer的文档中寻找min_df参数（最小频率）：

min_df : float in range [0.0, 1.0] or int, default=1 min_df : 在 [0.0, 1.0] 或 int 范围内浮动，默认值=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.在构建词汇表时，忽略文档频率严格低于给定阈值的术语。 This value is also called cut-off in the literature.该值在文献中也称为截止值。 If float, the parameter represents a proportion of documents, integer absolute counts.如果是float，该参数代表文档的比例，整数绝对计数。 This parameter is ignored if vocabulary is not None.如果词汇表不是 None，则忽略此参数。

# remove words occuring less than 5 times
tfidf = TfidfVectorizer(min_df=5)

you can also remove common words:您还可以删除常用词：

# remove words occuring in more than half the documents
tfidf = TfidfVectorizer(max_df=0.5)

you can also remove stopwords like this:您还可以像这样删除停用词：

tfidf = TfidfVectorizer(stop_words='english')

Answer 2

ShmulikA's answer will most likely work well but will remove words based on document frequency. ShmulikA 的答案很可能效果很好，但会根据文档频率删除单词。 Thus, if the specific word occurs 200 times in only 1 document, it will be removed.因此，如果特定单词仅在 1 个文档中出现 200 次，它将被删除。 TF-IDF vectorizer does not provide exactly what you want. TF-IDF 向量化器不能准确提供您想要的。 You would have to:你必须：

Fit the vectorizer to your corpus.使矢量化器适合您的语料库。 Extract the complete vocabulary from the vectorizer从向量化器中提取完整的词汇表
Take the words as keys in a new dictionary.把这些词当作新词典的关键词。
count every word occurrence:计算每个单词的出现次数：

for every document in corpus: for word in document: vocabulary[word] += 1

Now, find out if there are values = 1, drop these entries from the dictionary.现在，找出是否有值 = 1，从字典中删除这些条目。 Put the keys into a list and pass the list as parameter to the TF-IDF vectorizer.将键放入列表并将列表作为参数传递给 TF-IDF 向量化器。
It will need a lot of looping, maybe just use min_df, which works well in practice.它将需要大量循环，也许只需使用 min_df，这在实践中效果很好。

删除词汇表 TF-IDF 中单次出现的单词

问题描述

2 个解决方案

解决方案1
16 2017-08-22 05:44:13

解决方案2
3 2019-08-30 20:36:00

删除词汇表 TF-IDF 中单次出现的单词

问题描述

2 个解决方案

解决方案1 16 2017-08-22 05:44:13

解决方案2 3 2019-08-30 20:36:00

解决方案1
16 2017-08-22 05:44:13

解决方案2
3 2019-08-30 20:36:00