无法对 OLS 进行预测 Model

Question

我正在构建一个 OLS Model 但无法做出任何预测。

你能解释一下我做错了什么吗？

建设model：

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm 
import matplotlib.pyplot as plt

d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'], 
     'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
     'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
     'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
     'Total':[100,100,200,300,10,20,40,50,60,100,500]}

d = pd.DataFrame(data=d).set_index('Client Number')

df = pd.get_dummies(d,prefix='', prefix_sep='')

X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']

X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()

reg.summary()

预言：

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')

df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
y_new = df1[['Lisbon','Tokyo','Visa','No','Yes']]
x_new = df1['Total']
mod = sm.OLS(y_new, x_new)

mod.predict(reg.params)

然后显示： ValueError: shapes (3,1) and (11,) not aligned: 1 (dim 1) != 11 (dim 0)

我究竟做错了什么？

Answer 1

这是代码的固定预测部分和我的评论：

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')

主要问题是训练X1和x_new数据集的假人数量不同。 下面我添加了缺失的虚拟列并用零填充：

x_new = x_new.reindex(columns = X1.columns, fill_value=0)

现在x_new有适当的列数等于训练数据集X1 ：

               const  Lisbon  London  Madrid  ...  Master Card  Visa  No  Yes
Client Number                                 ...                            
11                 0       0       0       0  ...            0     1   0    1
12                 0       0       0       0  ...            0     1   0    1
13                 0       1       0       0  ...            0     1   1    0

[3 rows x 11 columns]

最后使用先前训练的 model reg对新数据集x_new进行预测：

reg.predict(x_new)

结果：

Client Number
11     35.956284
12     35.956284
13    135.956284
dtype: float64

附录

根据要求，我在下面附上完全可重现的代码来测试训练和预测任务：

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm 
import matplotlib.pyplot as plt

d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'], 
     'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
     'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
     'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
     'Total':[100,100,200,300,10,20,40,50,60,100,500]}

d = pd.DataFrame(data=d).set_index('Client Number')

df = pd.get_dummies(d,prefix='', prefix_sep='')

X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']

X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()

reg.summary()

###
d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')

x_new = x_new.reindex(columns = X1.columns, fill_value=0)

reg.predict(x_new)

Answer 2

最大的问题是您没有使用相同的虚拟转换。 也就是说，df1 中的某些值不存在。 您可以使用以下代码（来自此处）添加缺失值/列：

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
 'Card': ['Visa','Visa','Visa'],
 'Colateral':['Yes','Yes','No'],
 'Client Number':[11,12,13],
 'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
print(df1.shape)  # Shape is 3x6 but it has to be 3x11
# Get missing columns in the training test
missing_cols = set( df.columns ) - set( df1.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    df1[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
df1 = df1[df.columns]
print(df1.shape)  # Shape is 3x11

此外，您混淆了x_new和y_new 。 所以应该是：

x_new = df1.drop(['Total'], axis=1).values
y_new = df1['Total'].values
mod = sm.OLS(y_new, x_new)

mod.predict(reg.params)

请注意，我使用x_new = df1.drop(['Total'], axis=1).values而不是df1[['Lisbon','Tokyo','Visa','No','Yes']]更方便（就 1）而言更不容易（打字）错误和 2）代码更少

Answer 3

首先，您需要对所有单词进行字符串索引，或者对值进行单热编码。 ML 模型不接受文字，只接受数字。 接下来，您希望 X 和 y 为：

X = d.iloc[:,:-1]
y = d.iloc[:,-1]

这样，X 的形状为 [11,3]，而 y 的形状为 [11,]，这是所需的正确形状。

无法对 OLS 进行预测 Model

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-08-19 09:20:49

解决方案2
0 2020-08-19 09:03:11

解决方案3
0 2020-08-19 09:05:53

无法对 OLS 进行预测 Model

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-08-19 09:20:49

解决方案2 0 2020-08-19 09:03:11

解决方案3 0 2020-08-19 09:05:53

解决方案1
1 已采纳 2020-08-19 09:20:49

解决方案2
0 2020-08-19 09:03:11

解决方案3
0 2020-08-19 09:05:53