python，测试集和训练集中的逻辑回归

Question

file = pd.DataFrame({'name':['s', 'k', 'lo', 'ki'] , 'age':[12, 23, 32, 22], 'marks':[34, 34, 43, 22], 'score':[1, 1, 0, 1]})

I would like to run a logistic regression with the command : 我想使用以下命令运行逻辑回归：

import statsmodels.formula.api as smf 
logit = smf.logit( 'score ~ age + marks', file)
results = logit.fit()

But I get a error: 但是我得到一个错误：

"statsmodels.tools.sm_exceptions.PerfectSeparationError:
Perfect separation detected, results not available".

I would also split the data in to train set and test set how can I do it? 我还将数据拆分为训练集和测试集，该怎么办？ I have to use the predict command after this. 在此之后，我必须使用预测命令。

"glm" command in R looks much easier than Python. R中的“ glm”命令看起来比Python容易得多。

Answer 1

I came across a similar error too when I was working with some data. 在处理某些数据时，我也遇到了类似的错误。 This is due to the property of the data. 这是由于数据的属性。 Since the two groups (score=0 and score=1) are perfectly separated in your data, the decision boundary can be anywhere (infinite solution). 由于两组（分数= 0和分数= 1）在数据中完全分开，因此决策边界可以在任何地方（无限解）。 So the library is not able to return a single solution. 因此，该库无法返回单个解决方案。 This FIGURE shows your data. 此图显示您的数据。 Solution 1,2,3 are all valid. 解决方案1,2,3均有效。

I ran this using glmnet in Matlab. 我在Matlab中使用glmnet运行了它。 The error from Matlab reads: Matlab的错误为：

Warning: The estimated coefficients perfectly separate failures from successes. 警告：估计系数完美地将失败与成功区分开。 This means the theoretical best estimates are not finite. 这意味着理论上的最佳估计值是不确定的。

Using more data points will help. 使用更多数据点将有所帮助。

Interestingly, LogisticRegression from scikit-learn seems to work without complaints. 有趣的是，来自scikit-learn的LogisticRegression似乎可以正常工作。

Example code using scikit-learn for your problem is: 使用scikit-learn解决问题的示例代码是：

import pandas as pd
import numpy as np
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression

file = pd.DataFrame({'name':['s', 'k', 'lo', 'ki'] , 'age':[12, 23, 32, 22], 'marks':[34, 34, 43, 22], 'score':[1, 1, 0, 1]})
# Prepare the data
y,X = dmatrices('score ~ age + marks',file)
y = np.ravel(y)
# Fit the data to Logistic Regression model
model = LogisticRegression()
model = model.fit(X,y)

For splitting data into training and testing, you may want to refer to this: http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html 要将数据分为训练和测试，您可能需要参考以下内容： http : //scikit-learn.org/stable/modules/generation/sklearn.cross_validation.train_test_split.html

python，测试集和训练集中的逻辑回归

问题描述

1 个解决方案

解决方案1
1 2015-03-10 04:49:07

python，测试集和训练集中的逻辑回归

问题描述

1 个解决方案

解决方案1 1 2015-03-10 04:49:07

解决方案1
1 2015-03-10 04:49:07