简体   繁体   English

寻找降低机器学习分类误报率的想法

[英]Looking for ideas to lower the false positive rate in Machine Learning Classification

Is there a way to reduce the false positive rate in classic fraud prediction problem.有没有办法降低经典欺诈预测问题中的误报率。 Currently iam working on classic fraud detection.目前我正在研究经典的欺诈检测。 There are 50000 samples with true label(results were due to investigation).真实标签样本50000份(结果待查)。 Those training labels are fairly balanced.这些训练标签相当平衡。 Logisitic regression model that i choose is performing well with f1 score over 90 percent.我选择的逻辑回归 model 表现良好,f1 得分超过 90%。 Now when using the model to predict new cases results are 50/50(Fraud and non fraud).现在当使用 model 预测新案例时,结果为 50/50(欺诈和非欺诈)。 Is there a way to tune the model that lets to pass through non fraud cases and penalizes the false positive rate so that we detect less number of fraud cases(probably less than 200 out of one million) but they are highly likely to be fraud.有没有办法调整 model 让通过非欺诈案件并惩罚误报率,以便我们检测到更少数量的欺诈案件(可能少于一百万中的 200),但它们很可能是欺诈。 Hope that clears.希望清除。

Here are all the parameters that logistic regression model takes.以下是逻辑回归 model 采用的所有参数。

sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

Mostly the default ones work well so, if you have changed parameter then try using default ones.大多数情况下,默认值运行良好,因此,如果您更改了参数,请尝试使用默认值。 If you are already using default parameters and still getting poor result then you might want to change the parameters value according to your dataset.如果您已经在使用默认参数但结果仍然很差,那么您可能需要根据您的数据集更改参数值。 For that you need to know what all those parameter mean.为此,您需要知道所有这些参数的含义。 If you don't know that then follow This link如果您不知道,请点击此链接

So you want to bias the model towards predicting 'Not Fraud' more oftenly.因此,您想让 model 更频繁地预测“非欺诈”。 Depends on the model you are using.取决于您使用的 model。 If you want you are free to set a threshold on the output of your logistic regression model that allows only the instances for which the output is actually closer to 1 to be classified as 'Fraud'.如果您希望您可以自由设置逻辑回归 model 的 output 的阈值,该阈值仅允许 output 实际上更接近 1 的实例被归类为“Fraud”。 This can be done for example in sklearn by accesing the output probabilities of your model using predict_log_proba(X) or predict_proba(X) (log probabilities or probabilities).例如,这可以在 sklearn 中通过使用 predict_log_proba(X) 或 predict_proba(X)(对数概率或概率)访问 model 的 output 概率来完成。 (source: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression ) (来源: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression

If your model is supposed to output 1 for 'Fraud', you may threshold the output using an if (if output > 0.8 then 'Fraud'). If your model is supposed to output 1 for 'Fraud', you may threshold the output using an if (if output > 0.8 then 'Fraud').

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM