無法對 OLS 進行預測 Model

Question

我正在構建一個 OLS Model 但無法做出任何預測。

你能解釋一下我做錯了什么嗎？

建設model：

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm 
import matplotlib.pyplot as plt

d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'], 
     'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
     'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
     'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
     'Total':[100,100,200,300,10,20,40,50,60,100,500]}

d = pd.DataFrame(data=d).set_index('Client Number')

df = pd.get_dummies(d,prefix='', prefix_sep='')

X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']

X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()

reg.summary()

預言：

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')

df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
y_new = df1[['Lisbon','Tokyo','Visa','No','Yes']]
x_new = df1['Total']
mod = sm.OLS(y_new, x_new)

mod.predict(reg.params)

然后顯示： ValueError: shapes (3,1) and (11,) not aligned: 1 (dim 1) != 11 (dim 0)

我究竟做錯了什么？

Answer 1

這是代碼的固定預測部分和我的評論：

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')

主要問題是訓練X1和x_new數據集的假人數量不同。 下面我添加了缺失的虛擬列並用零填充：

x_new = x_new.reindex(columns = X1.columns, fill_value=0)

現在x_new有適當的列數等於訓練數據集X1 ：

               const  Lisbon  London  Madrid  ...  Master Card  Visa  No  Yes
Client Number                                 ...                            
11                 0       0       0       0  ...            0     1   0    1
12                 0       0       0       0  ...            0     1   0    1
13                 0       1       0       0  ...            0     1   1    0

[3 rows x 11 columns]

最后使用先前訓練的 model reg對新數據集x_new進行預測：

reg.predict(x_new)

結果：

Client Number
11     35.956284
12     35.956284
13    135.956284
dtype: float64

附錄

根據要求，我在下面附上完全可重現的代碼來測試訓練和預測任務：

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm 
import matplotlib.pyplot as plt

d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'], 
     'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
     'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
     'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
     'Total':[100,100,200,300,10,20,40,50,60,100,500]}

d = pd.DataFrame(data=d).set_index('Client Number')

df = pd.get_dummies(d,prefix='', prefix_sep='')

X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']

X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()

reg.summary()

###
d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')

x_new = x_new.reindex(columns = X1.columns, fill_value=0)

reg.predict(x_new)

Answer 2

最大的問題是您沒有使用相同的虛擬轉換。 也就是說，df1 中的某些值不存在。 您可以使用以下代碼（來自此處）添加缺失值/列：

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
 'Card': ['Visa','Visa','Visa'],
 'Colateral':['Yes','Yes','No'],
 'Client Number':[11,12,13],
 'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
print(df1.shape)  # Shape is 3x6 but it has to be 3x11
# Get missing columns in the training test
missing_cols = set( df.columns ) - set( df1.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    df1[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
df1 = df1[df.columns]
print(df1.shape)  # Shape is 3x11

此外，您混淆了x_new和y_new 。 所以應該是：

x_new = df1.drop(['Total'], axis=1).values
y_new = df1['Total'].values
mod = sm.OLS(y_new, x_new)

mod.predict(reg.params)

請注意，我使用x_new = df1.drop(['Total'], axis=1).values而不是df1[['Lisbon','Tokyo','Visa','No','Yes']]更方便（就 1）而言更不容易（打字）錯誤和 2）代碼更少

Answer 3

首先，您需要對所有單詞進行字符串索引，或者對值進行單熱編碼。 ML 模型不接受文字，只接受數字。 接下來，您希望 X 和 y 為：

X = d.iloc[:,:-1]
y = d.iloc[:,-1]

這樣，X 的形狀為 [11,3]，而 y 的形狀為 [11,]，這是所需的正確形狀。

無法對 OLS 進行預測 Model

問題描述

3 個解決方案

解決方案1
1 已采納 2020-08-19 09:20:49

解決方案2
0 2020-08-19 09:03:11

解決方案3
0 2020-08-19 09:05:53

無法對 OLS 進行預測 Model

問題描述

3 個解決方案

解決方案1 1 已采納 2020-08-19 09:20:49

解決方案2 0 2020-08-19 09:03:11

解決方案3 0 2020-08-19 09:05:53

解決方案1
1 已采納 2020-08-19 09:20:49

解決方案2
0 2020-08-19 09:03:11

解決方案3
0 2020-08-19 09:05:53