[英]Scikit learn - How to use SVM and Random Forest for text classification?
I have a set of trainFeatures
and a set of testFeatures
with positive, neutral and negative labels: 我有一组
trainFeatures
和一组带有正面,中性和负面标签的testFeatures
:
trainFeats = negFeats + posFeats + neutralFeats
testFeats = negFeats + posFeats + neutralFeats
For example, one entry inside the trainFeats
is 例如,
trainFeats
一个条目是
(['blue', 'yellow', 'green'], 'POSITIVE')
the same for the list of test features, so I specify the labels for each set. 对于测试功能列表也是如此,因此我为每个集指定了标签。 My question is how can I use the scikit implementation of Random Forest classifier and SVM to get the accuracy of this classifier altogether with precision and recall scores for each class?
我的问题是如何使用随机森林分类器和SVM的scikit实现来获得这个分类器的准确性与每个类的精确度和召回分数? The problem is that I am currently using words as features, while from what I read these classifiers require numbers.
问题是我目前正在使用单词作为功能,而从我读到的这些分类器需要数字。 Is there a way I can achieve my purpose without changing functionality?
有没有办法在不改变功能的情况下实现我的目的? Many thanks!
非常感谢!
You can look into this scikit-learn tutorial and especially the section on learning and predicting for how to create and use a classifier. 您可以查看这个scikit-learn教程 ,尤其是关于学习和预测如何创建和使用分类器的部分。 The example uses SVM, however it is simple to use RandomForestClassifier instead as all classifiers implement the
fit
and predict
methods. 该示例使用SVM,但是使用RandomForestClassifier很简单,因为所有分类器都实现了
fit
和predict
方法。
When working with text features you can use CountVectorizer or DictVectorizer . 使用文本功能时,您可以使用CountVectorizer或DictVectorizer 。 Take a look at feature extraction and especially section 4.1.3 .
看一下特征提取 ,特别是4.1.3节 。
You can find an example for classifying text documents here . 您可以在此处找到用于对文本文档进行分类的示例。
Then you can get the precision and recall of the classifier with the classification report . 然后,您可以使用分类报告获得分类器的精确度和召回率。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.