scikit-learn和mllib在预测python中的区别

Question

I have an issue with an SVM model trained for binary classification using Spark 2.0.0. 我对使用Spark 2.0.0进行二进制分类训练的SVM模型有问题。 I have followed the same logic using scikit-learn and MLlib, using the exact same dataset. 我使用完全相同的数据集，使用scikit-learn和MLlib遵循了相同的逻辑。 For scikit learn I have the following code: 对于scikit，我有以下代码：

svc_model = SVC()
svc_model.fit(X_train, y_train)

print "supposed to be 1"
print svc_model.predict([15 ,15,0,15,15,4,12,8,0,7])
print svc_model.predict([15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0])
print svc_model.predict([15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0])
print svc_model.predict([7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0])

print "supposed to be 0"
print svc_model.predict([18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0])
print svc_model.predict([ 11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0])
print svc_model.predict([ 15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0])
print svc_model.predict([ 15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0])

and it returns: 它返回：

supposed to be 1
[0]
[1]
[1]
[1]
supposed to be 0
[0]
[0]
[0]
[0]

For spark am doing: 对于火花正在做：

model_svm = SVMWithSGD.train(trainingData, iterations=100)

print "supposed to be 1"
print model_svm.predict(Vectors.dense(15.0,15.0,0.0,15.0,15.0,4.0,12.0,8.0,0.0,7.0))
print model_svm.predict(Vectors.dense(15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0))
print model_svm.predict(Vectors.dense(15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0))
print model_svm.predict(Vectors.dense(7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0))

print "supposed to be 0"
print model_svm.predict(Vectors.dense(18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0))
print model_svm.predict(Vectors.dense(11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0))

which returns: 返回：

supposed to be 1
1
1
1
1
supposed to be 0
1
1
1
1

have tried to keep my positive-negative classes balanced my test data contain 3521 records and my training data 8356 records. 为了使我的正负类保持平衡，我的测试数据包含3521条记录，而训练数据8356条记录。 For the evaluation, cross-validation applied on the scikit-learn model gives 98% accuracy and for spark the area under ROC is 0.5, the are under PR is 0.74 and 0.47 training error. 为了进行评估，在scikit-learn模型上进行的交叉验证可提供98％的准确度，而对于火花，ROC下的面积为0.5，PR下的面积为0.74，训练误差为0.47。

I have also tried to clear the threshold and set it back to 0.5, but this did not return any better results. 我也尝试清除阈值并将其设置回0.5，但这并没有返回任何更好的结果。 Sometimes when I am changing the train-test splitting I might get ie all zeros except for one correct prediction or all ones except for one correct zero prediction. 有时，当我更改火车测试拆分时，我可能会得到除一个正确的预测以外的所有零或一个正确的零预测以外的所有零。 Does anyone know how to approach this problem? 有谁知道如何解决这个问题？

As I said I have checked multiple times that my dataset is exactly the same in both cases. 正如我所说，我已经多次检查了我的数据集在两种情况下是完全相同的。

Answer 1

You're using different classifiers and so getting different results. 您使用了不同的分类器，因此得到了不同的结果。 Sklearn's SVC is a SVM with RBF kernel; Sklearn的SVC是带有RBF内核的SVM。 SVMWithSGD is an SVM with a linear kernel trained using SGD. SVMWithSGD是具有使用SGD训练的线性内核的SVM。 They are totally different. 他们是完全不同的。

If you want to match the results then I think the way to go is to use sklearn.linear_model.SGDClassifier(loss='hinge') and try to match other parameters (regularization, whether to fit intercept, etc.) because defaults are not the same. 如果您想匹配结果，那么我认为方法是使用sklearn.linear_model.SGDClassifier(loss='hinge')并尝试匹配其他参数（正则化，是否适合截距等），因为默认值不是相同。

Answer 2

Your call to clearThreshold , is causing the classifier to return the raw prediction scores: 您对clearThreshold调用导致分类器返回原始预测分数：

clearThreshold() Note Experimental Clears the threshold so that predict will output raw prediction scores. clearThreshold（）注意实验清除阈值，以便预测将输出原始预测分数。 It is used for binary classification only. 它仅用于二进制分类。

New in version 1.4.0. 1.4.0版的新功能。

If you want just the prediction class, remove this function call. 如果只需要预测类，则删除此函数调用。

scikit-learn和mllib在预测python中的区别

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-12-21 14:05:15

解决方案2
1 2016-12-21 02:06:24

scikit-learn和mllib在预测python中的区别

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-12-21 14:05:15

解决方案2 1 2016-12-21 02:06:24

解决方案1
3 已采纳 2016-12-21 14:05:15

解决方案2
1 2016-12-21 02:06:24