简体   繁体   English

如何使用scikit-learn对文本进行分类

[英]How to use scikit-learn to classify text

I want to classify two text by using scikit-learn. 我想通过使用scikit-learn对两个文本进行分类。 But I want to extract features by myself. 但是我想自己提取特征。 Just like using stop_words='english' to stop word list for English is used on CountVectorizer . 就像使用stop_words='english'来停止英语单词列表一样,在CountVectorizer也是如此 How to set my own word list to let CountVectorizer to count? 如何设置我自己的单词列表以让CountVectorizer进行计数?

You can provide your own list of stop words to the stop_words argument in the CountVectorizer and it will not count the words that you don't want it to be counted in your input text in scikit-learn. 您可以在CountVectorizer中的stop_words参数中提供自己的停用词列表,并且不会在scikit-learn的输入文本中对不希望包含在内的词进行计数。 For example, if I don't want words such as "cat", "dog" and "elephant" to use as tokens, I would instantiate CountVectorizer as the following: 例如,如果我不希望将诸如“ cat”,“ dog”和“ elephant”之类的词用作标记,我将实例化CountVectorizer如下:

CountVectorizer(stop_words=['cat','dog', elephant'])

Hope that helps. 希望能有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM