简体   繁体   English

python,测试集和训练集中的逻辑回归

[英]logistic regression in python, Test set and Train set

file = pd.DataFrame({'name':['s', 'k', 'lo', 'ki'] , 'age':[12, 23, 32, 22], 'marks':[34, 34, 43, 22], 'score':[1, 1, 0, 1]})

I would like to run a logistic regression with the command : 我想使用以下命令运行逻辑回归:

import statsmodels.formula.api as smf 
logit = smf.logit( 'score ~ age + marks', file)
results = logit.fit() 

But I get a error: 但是我得到一个错误:

"statsmodels.tools.sm_exceptions.PerfectSeparationError:
Perfect separation detected, results not available". 

I would also split the data in to train set and test set how can I do it? 我还将数据拆分为训练集和测试集,该怎么办? I have to use the predict command after this. 在此之后,我必须使用预测命令。

"glm" command in R looks much easier than Python. R中的“ glm”命令看起来比Python容易得多。

I came across a similar error too when I was working with some data. 在处理某些数据时,我也遇到了类似的错误。 This is due to the property of the data. 这是由于数据的属性。 Since the two groups (score=0 and score=1) are perfectly separated in your data, the decision boundary can be anywhere (infinite solution). 由于两组(分数= 0和分数= 1)在数据中完全分开,因此决策边界可以在任何地方(无限解)。 So the library is not able to return a single solution. 因此,该库无法返回单个解决方案。 This FIGURE shows your data. 显示您的数据。 Solution 1,2,3 are all valid. 解决方案1,2,3均有效。

I ran this using glmnet in Matlab. 我在Matlab中使用glmnet运行了它。 The error from Matlab reads: Matlab的错误为:

Warning: The estimated coefficients perfectly separate failures from successes. 警告:估计系数完美地将失败与成功区分开。 This means the theoretical best estimates are not finite. 这意味着理论上的最佳估计值是不确定的。

Using more data points will help. 使用更多数据点将有所帮助。

Interestingly, LogisticRegression from scikit-learn seems to work without complaints. 有趣的是,来自scikit-learn的LogisticRegression似乎可以正常工作。

Example code using scikit-learn for your problem is: 使用scikit-learn解决问题的示例代码是:

import pandas as pd
import numpy as np
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression

file = pd.DataFrame({'name':['s', 'k', 'lo', 'ki'] , 'age':[12, 23, 32, 22], 'marks':[34, 34, 43, 22], 'score':[1, 1, 0, 1]})
# Prepare the data
y,X = dmatrices('score ~ age + marks',file)
y = np.ravel(y)
# Fit the data to Logistic Regression model
model = LogisticRegression()
model = model.fit(X,y)

For splitting data into training and testing, you may want to refer to this: http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html 要将数据分为训练和测试,您可能需要参考以下内容: http : //scikit-learn.org/stable/modules/generation/sklearn.cross_validation.train_test_split.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在python中使用Apache Spark mllib在Logistic回归中设置优化器 - How to set optimizer in logistic regression in the apache spark mllib with python 我们如何在python中为逻辑回归设置成功类别? - How do we set the success category for logistic regression in python? 当我在 Lasso Regression 中拆分训练集和测试集时,R^2 为负数 - R^2 is negative when I split train and test set in Lasso Regression 将数据拆分为测试和训练,在 Pandas 中制作逻辑回归模型 - splitting data into test and train, making a logistic regression model in pandas Bootstrapping with logistic regression in Python - 构建测试向量 - Bootstrapping with logistic regression in Python - constructing test vector python中的逻辑回归测试输入格式帮助 - Logistic Regression test input format help in python 在逻辑回归中,如何为 python 中的虚拟变量设置“参考水平” - In logistic regression, how do I set my 'reference level' for my dummy variables in python 在Python中,如何对包含非常大的x值和非常小的y值的数据集执行逻辑回归? - In Python, how to perform logistic regression for data set containing very large values of x and very small values of y? 将队列分为训练/测试集 - Split queue into train/test set 将数据集拆分为训练并测试 python 中的时间序列分析 - Split data set into train and test for time series analysis in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM