简体   繁体   English

如何在scikit中使用SVM学习分类不同的测试数据以检查垃圾邮件

[英]How to use SVM in scikit learn to classify different test data for review spam detection

I am doing review spam detection using SVM in scikit learn. 我正在使用scikit Learn中的SVM进行垃圾邮件检测。 for this task i am using gold standard data set of truthful and deceptive reviews of each 400. Now i have done so far is to train and test split of this same dataset and find accuracy. 为此,我使用了每组400个真实和具有欺骗性的评论的黄金标准数据集。现在,我到目前为止所做的就是训练和测试该相同数据集的分割并寻找准确性。

Now I want to train my SVM classifier using this dataset and then want to classify my new downloaded test data different then original data set. 现在,我想使用该数据集训练我的SVM分类器,然后对不同于原始数据集的新下载的测试数据进行分类。

How can I do this task. 我该怎么做。 My code so far is: 到目前为止,我的代码是:

def main():
        init();
        dir_path ='C:\spam\hotel-reviews'
        files = sklearn.datasets.load_files(dir_path)
        model = CountVectorizer()
        X_train = model.fit_transform(files.data)

       tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True).fit(word_counts)
        X = tf_transformer.transform(word_counts)
        #print X
        print '\n\n'

    # create classifier
        clf = sklearn.svm.LinearSVC()
    # test the classifier
        test_classifier(X, files.target, clf, test_size=0.2, y_names=files.target_names, confusion=False)

def test_classifier(X, y, clf, test_size=0.3, y_names=None, confusion=False):
    #train-test split
    X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=test_size)

    clf.fit(X_train, y_train)
    y_predicted = clf.predict(X_test)

    print sklearn.metrics.classification_report(y_test, y_predicted, target_names=y_names)

if __name__ == '__main__':
    main()

Now i want to classify my own different review data of 500 reviews in reviews.txt file using above trained classifier, so how can i do this? 现在,我想使用上面训练有素的分类器,将我自己的500条评论的不同评论数据分类为上述训练有素的分类器,那么我该怎么做?

To score your data two steps are needed. 要对数据评分,需要两个步骤。 Either return clf and usea separate method for scoring or you can use within same method. 返回clf并使用单独的评分方法,也可以在同一方法中使用。 This is the workflow 这是工作流程

def scoreData(clf): 
    x_for_predict = loadScoringData("reviews.txt") # Signature only. assuming same data format without target variable 
    y_predict = clf.predict(x_for_predict)
    plotResults(clf, y_predict)# just a signature. 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM