擬合模型時發生ValueError

Question

我正在運行此代碼只是為了檢查線性回歸模型如何在python中工作：

import pandas as pd
import numpy as np
import statsmodels.api as sm

train = pd.read_csv('data/train.csv', parse_dates=[0])
test = pd.read_csv('data/test.csv', parse_dates=[0])

print train.head()

#Feature engineering
temp_train = pd.DatetimeIndex(train['datetime'])
train['year'] = temp_train.year
train['month'] = temp_train.month
train['hour'] = temp_train.hour
train['weekday'] = temp_train.weekday

temp_test = pd.DatetimeIndex(test['datetime'])
test['year'] = temp_test.year
test['month'] = temp_test.month
test['hour'] = temp_test.hour
test['weekday'] = temp_test.weekday

#Define features vector
features = ['season', 'holiday', 'workingday', 'weather',
            'temp', 'atemp', 'humidity', 'windspeed', 'year',
            'month', 'weekday', 'hour']

#The evaluation metric is the RMSE in the log domain,
#so we should transform the target columns into log domain as well.
for col in ['casual', 'registered', 'count']:
    train['log-' + col] = train[col].apply(lambda x: np.log1p(x))

#Split train data set into training and validation sets
training, validation = train[:int(0.8*len(train))], train[int(0.8*len(train)):]

# Create a linear model
X = sm.add_constant(training[features])
model = sm.OLS(training['log-count'],X) # OLS stands for Ordinary Least Squares
f = model.fit()

ypred = f.predict(sm.add_constant(validation[features]))
print(ypred)

plt.figure();
plt.plot(validation[features], ypred, 'o', validation[features], validation['log-count'], 'b-');
plt.title('blue: true,   red: OLS');

彈出以下錯誤信息。 這是什么意思，以及如何解決？

Traceback (most recent call last):
  File "C:/TestModel/linear_regression.py", line 99, in <module>
    ypred = f.predict(sm.add_constant(validation[features]))
  File "C:\Python27\lib\site-packages\statsmodels\base\model.py", line 749, in predict
    return self.model.predict(self.params, exog, *args, **kwargs)
  File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 359, in predict
    return np.dot(exog, params)
ValueError: shapes (2178,12) and (13,) not aligned: 12 (dim 1) != 13 (dim 0)

這是數據樣本：

print training.head()
             datetime  season  holiday  workingday  weather  temp   atemp  \
0 2011-01-01 00:00:00       1        0           0        1  9.84  14.395   
1 2011-01-01 01:00:00       1        0           0        1  9.02  13.635   
2 2011-01-01 02:00:00       1        0           0        1  9.02  13.635   
3 2011-01-01 03:00:00       1        0           0        1  9.84  14.395   
4 2011-01-01 04:00:00       1        0           0        1  9.84  14.395   

   humidity  windspeed  casual  registered  count  year  month  hour  weekday  \
0        81          0       3          13     16  2011      1     0        5   
1        80          0       8          32     40  2011      1     1        5   
2        80          0       5          27     32  2011      1     2        5   
3        75          0       3          10     13  2011      1     3        5   
4        75          0       0           1      1  2011      1     4        5   

   log-casual  log-registered  log-count  
0    1.386294        2.639057   2.833213  
1    2.197225        3.496508   3.713572  
2    1.791759        3.332205   3.496508  
3    1.386294        2.397895   2.639057  
4    0.000000        0.693147   0.693147  


print validation.head()
                datetime  season  holiday  workingday  weather   temp   atemp  \
8708 2012-08-05 05:00:00       3        0           0        1  29.52  34.850   
8709 2012-08-05 06:00:00       3        0           0        1  29.52  34.850   
8710 2012-08-05 07:00:00       3        0           0        1  30.34  35.605   
8711 2012-08-05 08:00:00       3        0           0        1  31.16  36.365   
8712 2012-08-05 09:00:00       3        0           0        1  32.80  38.635   

      humidity  windspeed  casual  registered  count  year  month  hour  \
8708        74    16.9979       1          18     19  2012      8     5   
8709        79    16.9979       7          12     19  2012      8     6   
8710        74    19.9995      18          50     68  2012      8     7   
8711        66    22.0028      27          81    108  2012      8     8   
8712        59    23.9994      61         168    229  2012      8     9   

      weekday  log-casual  log-registered  log-count  
8708        6    0.693147        2.944439   2.995732  
8709        6    2.079442        2.564949   2.995732  
8710        6    2.944439        3.931826   4.234107  
8711        6    3.332205        4.406719   4.691348  
8712        6    4.127134        5.129899   5.438079

Answer 1

對於此用例，這看起來像是add_constant函數的設計問題。

從文檔字符串：

“對於ndarrays和pandas.DataFrames，請檢查以確保不包含常量。如果存在至少一列的常量，則返回原始對象。”

http://statsmodels.sourceforge.net/devel/_modules/statsmodels/tools/tools.html#add_constant

我認為以這種方式定義此方法是為了避免使用奇異的設計矩陣進行估算，但是predict也將適用於奇異的矩陣。

我的猜測是，您的validation數據只有一列具有所有相同的值，例如它們都可能來自同一年。 如果這是故意的，則需要將常量手動添加到數據框。

如果add_constant有一個選項可以改變這種行為，那會更好。

擬合模型時發生ValueError

問題描述

1 個解決方案

解決方案1
2 已采納 2015-09-15 15:13:11

擬合模型時發生ValueError

問題描述

1 個解決方案

解決方案1 2 已采納 2015-09-15 15:13:11

解決方案1
2 已采納 2015-09-15 15:13:11