拟合模型时发生ValueError

Question

我正在运行此代码只是为了检查线性回归模型如何在python中工作：

import pandas as pd
import numpy as np
import statsmodels.api as sm

train = pd.read_csv('data/train.csv', parse_dates=[0])
test = pd.read_csv('data/test.csv', parse_dates=[0])

print train.head()

#Feature engineering
temp_train = pd.DatetimeIndex(train['datetime'])
train['year'] = temp_train.year
train['month'] = temp_train.month
train['hour'] = temp_train.hour
train['weekday'] = temp_train.weekday

temp_test = pd.DatetimeIndex(test['datetime'])
test['year'] = temp_test.year
test['month'] = temp_test.month
test['hour'] = temp_test.hour
test['weekday'] = temp_test.weekday

#Define features vector
features = ['season', 'holiday', 'workingday', 'weather',
            'temp', 'atemp', 'humidity', 'windspeed', 'year',
            'month', 'weekday', 'hour']

#The evaluation metric is the RMSE in the log domain,
#so we should transform the target columns into log domain as well.
for col in ['casual', 'registered', 'count']:
    train['log-' + col] = train[col].apply(lambda x: np.log1p(x))

#Split train data set into training and validation sets
training, validation = train[:int(0.8*len(train))], train[int(0.8*len(train)):]

# Create a linear model
X = sm.add_constant(training[features])
model = sm.OLS(training['log-count'],X) # OLS stands for Ordinary Least Squares
f = model.fit()

ypred = f.predict(sm.add_constant(validation[features]))
print(ypred)

plt.figure();
plt.plot(validation[features], ypred, 'o', validation[features], validation['log-count'], 'b-');
plt.title('blue: true,   red: OLS');

弹出以下错误信息。 这是什么意思，以及如何解决？

Traceback (most recent call last):
  File "C:/TestModel/linear_regression.py", line 99, in <module>
    ypred = f.predict(sm.add_constant(validation[features]))
  File "C:\Python27\lib\site-packages\statsmodels\base\model.py", line 749, in predict
    return self.model.predict(self.params, exog, *args, **kwargs)
  File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 359, in predict
    return np.dot(exog, params)
ValueError: shapes (2178,12) and (13,) not aligned: 12 (dim 1) != 13 (dim 0)

这是数据样本：

print training.head()
             datetime  season  holiday  workingday  weather  temp   atemp  \
0 2011-01-01 00:00:00       1        0           0        1  9.84  14.395   
1 2011-01-01 01:00:00       1        0           0        1  9.02  13.635   
2 2011-01-01 02:00:00       1        0           0        1  9.02  13.635   
3 2011-01-01 03:00:00       1        0           0        1  9.84  14.395   
4 2011-01-01 04:00:00       1        0           0        1  9.84  14.395   

   humidity  windspeed  casual  registered  count  year  month  hour  weekday  \
0        81          0       3          13     16  2011      1     0        5   
1        80          0       8          32     40  2011      1     1        5   
2        80          0       5          27     32  2011      1     2        5   
3        75          0       3          10     13  2011      1     3        5   
4        75          0       0           1      1  2011      1     4        5   

   log-casual  log-registered  log-count  
0    1.386294        2.639057   2.833213  
1    2.197225        3.496508   3.713572  
2    1.791759        3.332205   3.496508  
3    1.386294        2.397895   2.639057  
4    0.000000        0.693147   0.693147  


print validation.head()
                datetime  season  holiday  workingday  weather   temp   atemp  \
8708 2012-08-05 05:00:00       3        0           0        1  29.52  34.850   
8709 2012-08-05 06:00:00       3        0           0        1  29.52  34.850   
8710 2012-08-05 07:00:00       3        0           0        1  30.34  35.605   
8711 2012-08-05 08:00:00       3        0           0        1  31.16  36.365   
8712 2012-08-05 09:00:00       3        0           0        1  32.80  38.635   

      humidity  windspeed  casual  registered  count  year  month  hour  \
8708        74    16.9979       1          18     19  2012      8     5   
8709        79    16.9979       7          12     19  2012      8     6   
8710        74    19.9995      18          50     68  2012      8     7   
8711        66    22.0028      27          81    108  2012      8     8   
8712        59    23.9994      61         168    229  2012      8     9   

      weekday  log-casual  log-registered  log-count  
8708        6    0.693147        2.944439   2.995732  
8709        6    2.079442        2.564949   2.995732  
8710        6    2.944439        3.931826   4.234107  
8711        6    3.332205        4.406719   4.691348  
8712        6    4.127134        5.129899   5.438079

Answer 1

对于此用例，这看起来像是add_constant函数的设计问题。

从文档字符串：

“对于ndarrays和pandas.DataFrames，请检查以确保不包含常量。如果存在至少一列的常量，则返回原始对象。”

http://statsmodels.sourceforge.net/devel/_modules/statsmodels/tools/tools.html#add_constant

我认为以这种方式定义此方法是为了避免使用奇异的设计矩阵进行估算，但是predict也将适用于奇异的矩阵。

我的猜测是，您的validation数据只有一列具有所有相同的值，例如它们都可能来自同一年。 如果这是故意的，则需要将常量手动添加到数据框。

如果add_constant有一个选项可以改变这种行为，那会更好。

拟合模型时发生ValueError

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-09-15 15:13:11

拟合模型时发生ValueError

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-09-15 15:13:11

解决方案1
2 已采纳 2015-09-15 15:13:11