sci-kit学习中关于SVC概率输出的网格搜索交叉验证

Question

I'd like to run a grid search cross-validation on the probability outputs of the SVC classifier. 我想对SVC分类器的概率输出进行网格搜索交叉验证。 In particular I'd like to minimize the negative log likelihood. 特别是，我想最大程度地降低对数可能性。 From the documentation it seems that GridSearchCV calls the predict() method of the estimator it is passed and the predict() method of SVC returns class predictions not probabilities ( predict_proba() returns class probabilities). 从文档似乎GridSearchCV调用predict()它是通过估计的方法和predict()的方法SVC返回类的预测并不概率（ predict_proba()返回类的概率）。

1) Do I need to subclass SVC and give it a predict() method that returns probabilities rather than classes to accomplish my log likelihood cross validation? 1）我是否需要对SVC进行子类化，并为其提供一个predict()方法来返回概率而不是类来完成对数似然交叉验证？ I guess I then need to write my own score_func or loss_func ? 我想我需要写自己的score_func或loss_func吗？

2) Is cross-validating on this negative log likelihood dumb? 2）是否对此负对数似然性进行交叉验证？ I'm doing it b/c the dataset is: a) imbalanced 5:1 and b) not at all separable ie even the "worst" observations have a > 50% chance of being in the "good" class. 我正在这样做b / c数据集是：a）不平衡的5：1和b）根本不可分离，即，即使“最差”的观察也有> 50％的机会成为“好”类。 (Will probably also post this 2nd question on stats q&a) （可能还会在统计信息问答上发布此第二个问题）

Answer 1

Yes, you would, on both accounts. 是的，两个帐户都可以。

 class ProbSVC(SVC): def predict(self, X): return super(ProbSVC, self).predict_proba(X)

I'm not sure if this would work since the majority class may still dominate the log-likelihood scores and the final estimator might still produce >.5 positive for samples of the minority class. 我不确定这是否行得通，因为多数类仍然可以控制对数似然得分，而最终估计量仍可能对少数类样本产生> .5的正数。 I'm not sure, though, so please post this to stats. 不过，我不确定，因此请将其发布到统计信息中。

Answer 2

With the new scorer interface in the development version of sklearn, you do not need subclassing. 使用sklearn开发版本中的新评分器接口，您无需子类化。 You only need to define a scoring object as described in the docs Basically you need to do log_loss_score = Scorer(neg_log_loss, needs_threshold=True) This will possibly fallback to "decision_function", though. 您只需要按照文档中的说明定义评分对象。基本上，您需要执行log_loss_score = Scorer(neg_log_loss, needs_threshold=True) ，但是这可能会退回到“ decision_function”。

You could also define a new scorer class that calls predict_proba on the estimator to ensure that it gets normalized probabilities. 您还可以定义一个新的计分器类，以在估计器上调用predict_proba以确保其获得归一化的概率。

Also, a pull request for log-loss would be welcome :) 另外，也欢迎提出对数损失的请求:)

sci-kit学习中关于SVC概率输出的网格搜索交叉验证

问题描述

2 个解决方案

解决方案1
2 已采纳 2013-05-21 11:48:23

解决方案2
2 2013-05-21 14:58:23

sci-kit学习中关于SVC概率输出的网格搜索交叉验证

问题描述

2 个解决方案

解决方案1 2 已采纳 2013-05-21 11:48:23

解决方案2 2 2013-05-21 14:58:23

解决方案1
2 已采纳 2013-05-21 11:48:23

解决方案2
2 2013-05-21 14:58:23