简体   繁体   English

如何提高Scikit python中逻辑回归的模型精度?

[英]How to increase the model accuracy of logistic regression in Scikit python?

I am trying to predict the admit variable with predictors such as gre,gpa and ranks.But the prediction accuracy is very less(0.66).The dataset is given below. 我试图用gre,gpa和rank等预测变量来预测admit变量。但是预测精度非常低(0.66)。数据集如下所示。 https://gist.github.com/abyalias/3de80ab7fb93dcecc565cee21bd9501a https://gist.github.com/abyalias/3de80ab7fb93dcecc565cee21bd9501a

Please find the codes below: 请找到以下代码:

 In[73]: data.head(20)
 Out[73]: 

   admit  gre   gpa  rank_2  rank_3  rank_4
0      0  380  3.61     0.0     1.0     0.0
1      1  660  3.67     0.0     1.0     0.0
2      1  800  4.00     0.0     0.0     0.0
3      1  640  3.19     0.0     0.0     1.0
4      0  520  2.93     0.0     0.0     1.0
5      1  760  3.00     1.0     0.0     0.0
6      1  560  2.98     0.0     0.0     0.0

y = data['admit']
x = data[data.columns[1:]]

from sklearn.cross_validation import  train_test_split
xtrain,xtest,ytrain,ytest  = train_test_split(x,y,random_state=2)

ytrain=np.ravel(ytrain)

#modelling 
clf = LogisticRegression(penalty='l2')
clf.fit(xtrain,ytrain)
ypred_train = clf.predict(xtrain)
ypred_test = clf.predict(xtest)

In[38]: #checking the classification accuracy
accuracy_score(ytrain,ypred_train)
Out[38]: 0.70333333333333337
In[39]: accuracy_score(ytest,ypred_test)
Out[39]: 0.66000000000000003

In[78]: #confusion metrix...
from sklearn.metrics import confusion_matrix
confusion_matrix(ytest,ypred)

Out[78]: 
array([[62,  1],
       [33,  4]])

The ones are wrongly predicting.How to increase the model accuracy? 那些是错误的预测。如何提高模型的准确性?

Since machine learning is more about experimenting with the features and the models, there is no correct answer to your question. 由于机器学习更多的是试验功能和模型,因此您的问题没有正确答案。 Some of my suggestions to you would be: 我给你的一些建议是:

1. Feature Scaling and/or Normalization - Check the scales of your gre and gpa features. 1.特征缩放和/或规范化 - 检查gregpa特征的比例。 They differ on 2 orders of magnitude. 它们的差异在2个数量级上。 Therefore, your gre feature will end up dominating the others in a classifier like Logistic Regression. 因此,你的gre特征最终会在像Logistic回归这样的分类器中统治其他特征。 You can normalize all your features to the same scale before putting them in a machine learning model. 在将所有功能放入机器学习模型之前,您可以将所有功能标准化为相同的比例。 This is a good guide on the various feature scaling and normalization classes available in scikit-learn. 是scikit-learn中提供的各种功能扩展和规范化类的良好指南。

2. Class Imbalance - Look for class imbalance in your data. 2.类不平衡 - 查找数据中的类不平衡。 Since you are working with admit/reject data, then the number of rejects would be significantly higher than the admits. 由于您正在处理允许/拒绝数据,因此拒绝的数量将显着高于许可。 Most classifiers in SkLearn including LogisticRegression have a class_weight parameter. SkLearn中的大多数分类器(包括LogisticRegression都有一个class_weight参数。 Setting that to balanced might also work well in case of a class imbalance. 如果类不平衡,将其设置为balanced也可能很有效。

3. Optimize other scores - You can optimize on other metrics also such as Log Loss and F1-Score . 3.优化其他分数 - 您还可以优化其他指标,例如Log LossF1-Score The F1-Score could be useful, in case of class imbalance. 在课堂不平衡的情况下,F1-Score可能很有用。 This is a good guide that talks more about scoring. 是一个很好的指导,可以更多地谈论得分。

4. Hyperparameter Tuning - Grid Search - You can improve your accuracy by performing a Grid Search to tune the hyperparameters of your model. 4.超参数调整 - 网格搜索 - 您可以通过执行网格搜索来调整模型的超参数来提高准确性。 For example in case of LogisticRegression , the parameter C is a hyperparameter. 例如,在LogisticRegression情况下,参数C是超参数。 Also, you should avoid using the test data during grid search. 此外,您应该避免在网格搜索期间使用测试数据。 Instead perform cross validation. 而是执行交叉验证。 Use your test data only to report the final numbers for your final model. 仅使用您的测试数据报告最终模型的最终数字。 Please note that GridSearch should be done for all models that you try because then only you will be able to tell what is the best you can get from each model. 请注意,GridSearch应该针对您尝试的所有型号进行,因为只有您才能知道每种型号可以获得的最佳效果。 Scikit-Learn provides the GridSearchCV class for this. Scikit-Learn为此提供了GridSearchCV类。 This article is also a good starting point. 这篇文章也是一个很好的起点。

5. Explore more classifiers - Logistic Regression learns a linear decision surface that separates your classes. 5.探索更多分类器 - Logistic回归学习一个分离你的类的线性决策表面。 It could be possible that your 2 classes may not be linearly separable. 您的2个类可能无法线性分离。 In such a case you might need to look at other classifiers such Support Vector Machines which are able to learn more complex decision boundaries. 在这种情况下,您可能需要查看其他分类器,例如支持向量机 ,它们能够学习更复杂的决策边界。 You can also start looking at Tree-Based classifiers such as Decision Trees which can learn rules from your data. 您还可以开始查看基于树的分类器,例如可以从数据中学习规则的决策树 Think of them as a series of If-Else rules which the algorithm automatically learns from the data. 将它们视为一系列If-Else规则,算法会自动从数据中学习。 Often, it is difficult to get the right Bias-Variance Tradeoff with Decision Trees, so I would recommend you to look at Random Forests if you have a considerable amount of data. 通常,很难通过决策树获得正确的偏差 - 方差权衡 ,因此如果您有大量数据,我建议您查看随机森林

6. Error Analysis - For each of your models, go back and look at the cases where they are failing. 6.错误分析 - 对于每个模型,请返回并查看失败的情况。 You might end up finding that some of your models work well on one part of the parameter space while others work better on other parts. 您最终可能会发现某些模型在参数空间的某个部分上运行良好,而其他模型在其他部分上运行得更好。 If this is the case, then Ensemble Techniques such as VotingClassifier techniques often give the best results. 如果是这种情况,那么诸如VotingClassifier技术之类的Ensemble技术通常会给出最好的结果。 Models that win Kaggle competitions are many times ensemble models. 赢得Kaggle比赛的模特很多次是合奏模特。

7. More Features _ If all of this fails, then that means that you should start looking for more features. 7.更多功能 _如果所有这些都失败了,那么这意味着您应该开始寻找更多功能。

Hope that helps! 希望有所帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM