[英]logistic regression in python, Test set and Train set
file = pd.DataFrame({'name':['s', 'k', 'lo', 'ki'] , 'age':[12, 23, 32, 22], 'marks':[34, 34, 43, 22], 'score':[1, 1, 0, 1]})
I would like to run a logistic regression with the command : 我想使用以下命令运行逻辑回归:
import statsmodels.formula.api as smf
logit = smf.logit( 'score ~ age + marks', file)
results = logit.fit()
But I get a error: 但是我得到一个错误:
"statsmodels.tools.sm_exceptions.PerfectSeparationError:
Perfect separation detected, results not available".
I would also split the data in to train set and test set how can I do it? 我还将数据拆分为训练集和测试集,该怎么办? I have to use the predict command after this. 在此之后,我必须使用预测命令。
"glm" command in R looks much easier than Python. R中的“ glm”命令看起来比Python容易得多。
I came across a similar error too when I was working with some data. 在处理某些数据时,我也遇到了类似的错误。 This is due to the property of the data. 这是由于数据的属性。 Since the two groups (score=0 and score=1) are perfectly separated in your data, the decision boundary can be anywhere (infinite solution). 由于两组(分数= 0和分数= 1)在数据中完全分开,因此决策边界可以在任何地方(无限解)。 So the library is not able to return a single solution. 因此,该库无法返回单个解决方案。 This FIGURE shows your data. 此图显示您的数据。 Solution 1,2,3 are all valid. 解决方案1,2,3均有效。
I ran this using glmnet in Matlab. 我在Matlab中使用glmnet运行了它。 The error from Matlab reads: Matlab的错误为:
Warning: The estimated coefficients perfectly separate failures from successes. 警告:估计系数完美地将失败与成功区分开。 This means the theoretical best estimates are not finite. 这意味着理论上的最佳估计值是不确定的。
Using more data points will help. 使用更多数据点将有所帮助。
Interestingly, LogisticRegression from scikit-learn seems to work without complaints. 有趣的是,来自scikit-learn的LogisticRegression似乎可以正常工作。
Example code using scikit-learn for your problem is: 使用scikit-learn解决问题的示例代码是:
import pandas as pd
import numpy as np
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
file = pd.DataFrame({'name':['s', 'k', 'lo', 'ki'] , 'age':[12, 23, 32, 22], 'marks':[34, 34, 43, 22], 'score':[1, 1, 0, 1]})
# Prepare the data
y,X = dmatrices('score ~ age + marks',file)
y = np.ravel(y)
# Fit the data to Logistic Regression model
model = LogisticRegression()
model = model.fit(X,y)
For splitting data into training and testing, you may want to refer to this: http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html 要将数据分为训练和测试,您可能需要参考以下内容: http : //scikit-learn.org/stable/modules/generation/sklearn.cross_validation.train_test_split.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.