[英]R's relevel() and factor variables in linear regression in pandas
數據:
a,b,c,d
1,5,9,red
2,6,10,blue
3,7,11,green
4,8,12,red
3,4,3,orange
3,4,3,blue
3,4,3,red
在 R 中,如果我想構建一個考慮分類數據的線性回歸模型(我認為它們在 R 中被稱為因子變量),我可以簡單地做:
df$d = relevel(df$d, 'green')
在此之后,為了構建模型,R 將為每種顏色添加列,例如:
dblue
0
1
0
0
0
1
0
將沒有綠色列,因為如果所有其他顏色值為 0,則表示綠色 = 1(這是我們的參考水平)。 現在,創建一個回歸模型:
mod = lm(a ~ b + c + d, data=df)
summary(mod)
Call:
lm(formula = a ~ b + c + d, data = rel)
Residuals:
1 2 3 4 5 6 7
4.708e-16 -7.061e-16 2.219e-31 2.354e-16 -1.233e-31 7.061e-16 -7.061e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.600e+00 3.622e-15 -4.418e+14 1.44e-15 ***
b 1.600e+00 9.403e-16 1.702e+15 3.74e-16 ***
c -6.000e-01 3.766e-16 -1.593e+15 4.00e-16 ***
dblue 8.829e-16 1.823e-15 4.840e-01 0.713
dorange 1.589e-15 2.294e-15 6.930e-01 0.614
dred 2.295e-15 1.631e-15 1.407e+00 0.393
我正在嘗試在 Python Pandas 中實現相同的目標。 到目前為止,我只是想出了這個:
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'], dtype='category')}
df = pd.DataFrame(d)
df['d'] = pd.Categorical(df['d'], ordered=False)
for r in df['d'].cat.categories:
if r != 'green':
df['d%s' % r] = df['d'] == r
df = df.drop('d', 1)
它有效並產生相同的結果,但我想知道 Pandas 中是否有用於此的方法。
你可以使用pd.get_dummies
:
import pandas as pd
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3],
'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'],
dtype='category')}
df = pd.DataFrame(d)
dummies = pd.get_dummies(df['d'])
df = pd.concat([df, dummies], axis=1)
df = df.drop(['d', 'green'], axis=1)
print(df)
產量
a b c blue orange red
0 1 5 9 0 0 1
1 2 6 10 1 0 0
2 3 7 11 0 0 0
3 4 8 12 0 0 1
4 3 4 3 0 1 0
5 3 4 3 1 0 0
6 3 4 3 0 0 1
使用statsmodels ,
import statsmodels.formula.api as smf
model = smf.ols('a ~ b + c + blue + orange + red', df).fit()
print(model.summary())
產量
OLS Regression Results
==============================================================================
Dep. Variable: a R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.149e+25
Date: Sun, 22 Mar 2015 Prob (F-statistic): 1.64e-13
Time: 05:57:33 Log-Likelihood: 200.74
No. Observations: 7 AIC: -389.5
Df Residuals: 1 BIC: -389.8
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -1.6000 6.11e-13 -2.62e+12 0.000 -1.600 -1.600
b 1.6000 1.59e-13 1.01e+13 0.000 1.600 1.600
c -0.6000 6.36e-14 -9.44e+12 0.000 -0.600 -0.600
blue 1.11e-16 3.08e-13 0.000 1.000 -3.91e-12 3.91e-12
orange 7.994e-15 3.87e-13 0.021 0.987 -4.91e-12 4.93e-12
red 4.829e-15 2.75e-13 0.018 0.989 -3.49e-12 3.5e-12
==============================================================================
Omnibus: nan Durbin-Watson: 0.203
Prob(Omnibus): nan Jarque-Bera (JB): 0.752
Skew: 0.200 Prob(JB): 0.687
Kurtosis: 1.445 Cond. No. 85.2
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
或者,您可以使用 patsy 公式來指定虛擬對比度:
import pandas as pd
import statsmodels.formula.api as smf
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3],
'd': ['red', 'blue', 'green', 'red', 'orange', 'blue', 'red']}
df = pd.DataFrame(d)
model = smf.ols('a ~ b + c + C(d, Treatment(reference="green"))', df).fit()
print(model.summary())
參考:
它也可以通過這種方式簡化;
import pandas as pd
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3],
'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'],
dtype='category')}
df = pd.DataFrame(d)
df = pd.get_dummies(df,prefix='color',drop_first=True)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.