[英]Python pandas linear regression groupby
I am trying to use a linear regression on a group by pandas python dataframe:我正在尝试通过 pandas python 数据框对组使用线性回归:
This is the dataframe df:这是数据框 df:
group date value
A 01-02-2016 16
A 01-03-2016 15
A 01-04-2016 14
A 01-05-2016 17
A 01-06-2016 19
A 01-07-2016 20
B 01-02-2016 16
B 01-03-2016 13
B 01-04-2016 13
C 01-02-2016 16
C 01-03-2016 16
#import standard packages
import pandas as pd
import numpy as np
#import ML packages
from sklearn.linear_model import LinearRegression
#First, let's group the data by group
df_group = df.groupby('group')
#Then, we need to change the date to integer
df['date'] = pd.to_datetime(df['date'])
df['date_delta'] = (df['date'] - df['date'].min()) / np.timedelta64(1,'D')
Now I want to predict the value for each group for 01-10-2016.现在我想预测 01-10-2016 的每个组的值。
I want to get to a new dataframe like this:我想获得这样的新数据框:
group 01-10-2016
A predicted value
B predicted value
C predicted value
This How to apply OLS from statsmodels to groupby doesn't work这个如何将 OLS 从 statsmodels 应用到 groupby不起作用
for group in df_group.groups.keys():
df= df_group.get_group(group)
X = df['date_delta']
y = df['value']
model = LinearRegression(y, X)
results = model.fit(X, y)
print results.summary()
I get the following error我收到以下错误
ValueError: Found arrays with inconsistent numbers of samples: [ 1 52]
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
UPDATE:更新:
I changed it to我把它改成
for group in df_group.groups.keys():
df= df_group.get_group(group)
X = df[['date_delta']]
y = df.value
model = LinearRegression(y, X)
results = model.fit(X, y)
print results.summary()
and now I get this error:现在我收到这个错误:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
def model(df, delta):
y = df[['value']].values
X = df[['date_delta']].values
return np.squeeze(LinearRegression().fit(X, y).predict(delta))
def group_predictions(df, date):
date = pd.to_datetime(date)
df.date = pd.to_datetime(df.date)
day = np.timedelta64(1, 'D')
mn = df.date.min()
df['date_delta'] = df.date.sub(mn).div(day)
dd = (date - mn) / day
return df.groupby('group').apply(model, delta=dd)
demo演示
group_predictions(df, '01-10-2016')
group
A 22.333333333333332
B 3.500000000000007
C 16.0
dtype: object
You're using LinearRegression
wrong.您使用的
LinearRegression
错误的。
model = LinearRegression()
fit
withfit
model.fit(X, y)
But all that does is set value in the object stored in model
There is no nice summary
method.但是所做的只是在
model
存储的对象中设置值 没有很好的summary
方法。 There probably is one somewhere, but I know the one in statsmodels
soooo, see below某处可能有一个,但我知道
statsmodels
中的statsmodels
,见下文
option 1选项1
use statsmodels
instead使用
statsmodels
代替
from statsmodels.formula.api import ols
for k, g in df_group:
model = ols('value ~ date_delta', g)
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: value R-squared: 0.652
Model: OLS Adj. R-squared: 0.565
Method: Least Squares F-statistic: 7.500
Date: Fri, 06 Jan 2017 Prob (F-statistic): 0.0520
Time: 10:48:17 Log-Likelihood: -9.8391
No. Observations: 6 AIC: 23.68
Df Residuals: 4 BIC: 23.26
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 14.3333 1.106 12.965 0.000 11.264 17.403
date_delta 1.0000 0.365 2.739 0.052 -0.014 2.014
==============================================================================
Omnibus: nan Durbin-Watson: 1.393
Prob(Omnibus): nan Jarque-Bera (JB): 0.461
Skew: -0.649 Prob(JB): 0.794
Kurtosis: 2.602 Cond. No. 5.78
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: value R-squared: 0.750
Model: OLS Adj. R-squared: 0.500
Method: Least Squares F-statistic: 3.000
Date: Fri, 06 Jan 2017 Prob (F-statistic): 0.333
Time: 10:48:17 Log-Likelihood: -3.2171
No. Observations: 3 AIC: 10.43
Df Residuals: 1 BIC: 8.631
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 15.5000 1.118 13.864 0.046 1.294 29.706
date_delta -1.5000 0.866 -1.732 0.333 -12.504 9.504
==============================================================================
Omnibus: nan Durbin-Watson: 3.000
Prob(Omnibus): nan Jarque-Bera (JB): 0.531
Skew: -0.707 Prob(JB): 0.767
Kurtosis: 1.500 Cond. No. 2.92
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: value R-squared: -inf
Model: OLS Adj. R-squared: -inf
Method: Least Squares F-statistic: -0.000
Date: Fri, 06 Jan 2017 Prob (F-statistic): nan
Time: 10:48:17 Log-Likelihood: 63.481
No. Observations: 2 AIC: -123.0
Df Residuals: 0 BIC: -125.6
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 16.0000 inf 0 nan nan nan
date_delta -3.553e-15 inf -0 nan nan nan
==============================================================================
Omnibus: nan Durbin-Watson: 0.400
Prob(Omnibus): nan Jarque-Bera (JB): 0.333
Skew: 0.000 Prob(JB): 0.846
Kurtosis: 1.000 Cond. No. 2.62
==============================================================================
As a newbie I cannot comment so I will write it as a new answer.作为新手,我无法发表评论,因此我将其写为新答案。 To solve an error:
要解决错误:
Runtime Error: ValueError : Expected 2D array, got scalar array instead
you need to reshape delta value in line:您需要在线重塑增量值:
return np.squeeze(LinearRegression().fit(X, y).predict(np.array(delta).reshape(1, -1)))
Credit stays for you piRSquared信用为您保留 piRSquared
This might be a late response but I post the answer anyway should someone encounters the same problem.这可能是一个迟到的回复,但如果有人遇到同样的问题,我无论如何都会发布答案。 Actually, everything that was shown was correct except for the regression block.
实际上,除了回归块之外,显示的所有内容都是正确的。 Here are the two problems with the implementation:
下面是实现的两个问题:
Please note that the model.fit(X, y)
gets an input X{array-like, sparse matrix} of shape (n_samples, n_features) for X. So both inputs for model.fit(X, y)
should be 2D.请注意,
model.fit(X, y)
为 X 获取形状为 (n_samples, n_features) 的输入 X{array-like, sparse matrix}。因此, model.fit(X, y)
两个输入都应该是二维的。 You can easily convert the 1D series to 2D by the reshape(-1, 1)
command.您可以通过
reshape(-1, 1)
命令轻松地将 1D 系列转换为 2D。
The second problem is the regression fitting process itself: y and X are not the input of model = LinearRegression(y, X)
but rather the input of `model.fit(X, y)'.第二个问题是回归拟合过程本身:y和X不是
model = LinearRegression(y, X)
的输入,而是`model.fit(X, y)'的输入。
Here is the modification to the regression block:这是对回归块的修改:
for group in df_group.groups.keys():
df= df_group.get_group(group)
X = np.array(df[['date_delta']]).reshape(-1, 1) # note that series does not have reshape function, thus you need to convert to array
y = np.array(df.value).reshape(-1, 1)
model = LinearRegression() # <--- this does not accept (X, y)
results = model.fit(X, y)
print results.summary()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.