Scikit学习 - 如何使用SVM和随机森林进行文本分类？

Question

I have a set of trainFeatures and a set of testFeatures with positive, neutral and negative labels: 我有一组trainFeatures和一组带有正面，中性和负面标签的testFeatures ：

trainFeats = negFeats + posFeats + neutralFeats
testFeats  = negFeats + posFeats + neutralFeats

For example, one entry inside the trainFeats is 例如， trainFeats一个条目是

(['blue', 'yellow', 'green'], 'POSITIVE')

the same for the list of test features, so I specify the labels for each set. 对于测试功能列表也是如此，因此我为每个集指定了标签。 My question is how can I use the scikit implementation of Random Forest classifier and SVM to get the accuracy of this classifier altogether with precision and recall scores for each class? 我的问题是如何使用随机森林分类器和SVM的scikit实现来获得这个分类器的准确性与每个类的精确度和召回分数？ The problem is that I am currently using words as features, while from what I read these classifiers require numbers. 问题是我目前正在使用单词作为功能，而从我读到的这些分类器需要数字。 Is there a way I can achieve my purpose without changing functionality? 有没有办法在不改变功能的情况下实现我的目的？ Many thanks! 非常感谢！

Answer 1

You can look into this scikit-learn tutorial and especially the section on learning and predicting for how to create and use a classifier. 您可以查看这个scikit-learn教程，尤其是关于学习和预测如何创建和使用分类器的部分。 The example uses SVM, however it is simple to use RandomForestClassifier instead as all classifiers implement the fit and predict methods. 该示例使用SVM，但是使用RandomForestClassifier很简单，因为所有分类器都实现了fit和predict方法。

When working with text features you can use CountVectorizer or DictVectorizer . 使用文本功能时，您可以使用CountVectorizer或DictVectorizer 。 Take a look at feature extraction and especially section 4.1.3 . 看一下特征提取，特别是4.1.3节。

You can find an example for classifying text documents here . 您可以在此处找到用于对文本文档进行分类的示例。

Then you can get the precision and recall of the classifier with the classification report . 然后，您可以使用分类报告获得分类器的精确度和召回率。

Scikit学习 - 如何使用SVM和随机森林进行文本分类？

问题描述

1 个解决方案

解决方案1
10 已采纳 2014-02-23 23:23:44

Scikit学习 - 如何使用SVM和随机森林进行文本分类？

问题描述

1 个解决方案

解决方案1 10 已采纳 2014-02-23 23:23:44

解决方案1
10 已采纳 2014-02-23 23:23:44