如何使用scikit-learn对文本进行分类

Question

I want to classify two text by using scikit-learn. 我想通过使用scikit-learn对两个文本进行分类。 But I want to extract features by myself. 但是我想自己提取特征。 Just like using stop_words='english' to stop word list for English is used on CountVectorizer . 就像使用stop_words='english'来停止英语单词列表一样，在CountVectorizer上也是如此 。 How to set my own word list to let CountVectorizer to count? 如何设置我自己的单词列表以让CountVectorizer进行计数？

Answer 1

You can provide your own list of stop words to the stop_words argument in the CountVectorizer and it will not count the words that you don't want it to be counted in your input text in scikit-learn. 您可以在CountVectorizer中的stop_words参数中提供自己的停用词列表，并且不会在scikit-learn的输入文本中对不希望包含在内的词进行计数。 For example, if I don't want words such as "cat", "dog" and "elephant" to use as tokens, I would instantiate CountVectorizer as the following: 例如，如果我不希望将诸如“ cat”，“ dog”和“ elephant”之类的词用作标记，我将实例化CountVectorizer如下：

CountVectorizer(stop_words=['cat','dog', elephant'])

Hope that helps. 希望能有所帮助。

如何使用scikit-learn对文本进行分类

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-08-29 16:00:00

如何使用scikit-learn对文本进行分类

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-08-29 16:00:00

解决方案1
0 已采纳 2017-08-29 16:00:00