无法对 OLS 进行预测 Model

Question

I'm Building an OLS Model but cant make any predictions.我正在构建一个 OLS Model 但无法做出任何预测。

Can you explain what I'm doing wrong?你能解释一下我做错了什么吗？

Building the model:建设model：

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm 
import matplotlib.pyplot as plt

d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'], 
     'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
     'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
     'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
     'Total':[100,100,200,300,10,20,40,50,60,100,500]}

d = pd.DataFrame(data=d).set_index('Client Number')

df = pd.get_dummies(d,prefix='', prefix_sep='')

X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']

X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()

reg.summary()

Prediction:预言：

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')

df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
y_new = df1[['Lisbon','Tokyo','Visa','No','Yes']]
x_new = df1['Total']
mod = sm.OLS(y_new, x_new)

mod.predict(reg.params)

Then it shows: ValueError: shapes (3,1) and (11,) not aligned: 1 (dim 1) != 11 (dim 0)然后显示： ValueError: shapes (3,1) and (11,) not aligned: 1 (dim 1) != 11 (dim 0)

What Am I doing wrong?我究竟做错了什么？

Answer 1

Here is the fixed prediction part of code with my comments:这是代码的固定预测部分和我的评论：

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')

The main problem is different number of dummies in training X1 and x_new dataset.主要问题是训练X1和x_new数据集的假人数量不同。 Below I add missing dummy columns and fill it with zero:下面我添加了缺失的虚拟列并用零填充：

x_new = x_new.reindex(columns = X1.columns, fill_value=0)

now x_new has proper number of columns equal to training dataset X1 :现在x_new有适当的列数等于训练数据集X1 ：

               const  Lisbon  London  Madrid  ...  Master Card  Visa  No  Yes
Client Number                                 ...                            
11                 0       0       0       0  ...            0     1   0    1
12                 0       0       0       0  ...            0     1   0    1
13                 0       1       0       0  ...            0     1   1    0

[3 rows x 11 columns]

Finally predict on new dataset x_new using previously trained model reg :最后使用先前训练的 model reg对新数据集x_new进行预测：

reg.predict(x_new)

result:结果：

Client Number
11     35.956284
12     35.956284
13    135.956284
dtype: float64

APPENDIX附录

As requested I enclose below fully reproducible code to test both training and prediction tasks:根据要求，我在下面附上完全可重现的代码来测试训练和预测任务：

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm 
import matplotlib.pyplot as plt

d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'], 
     'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
     'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
     'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
     'Total':[100,100,200,300,10,20,40,50,60,100,500]}

d = pd.DataFrame(data=d).set_index('Client Number')

df = pd.get_dummies(d,prefix='', prefix_sep='')

X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']

X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()

reg.summary()

###
d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')

x_new = x_new.reindex(columns = X1.columns, fill_value=0)

reg.predict(x_new)

Answer 2

The biggest issue is that you are not using the same dummy transformation.最大的问题是您没有使用相同的虚拟转换。 That is, some values in df1 are absent.也就是说，df1 中的某些值不存在。 You can add the missing values/columns with the following code (from here ):您可以使用以下代码（来自此处）添加缺失值/列：

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
 'Card': ['Visa','Visa','Visa'],
 'Colateral':['Yes','Yes','No'],
 'Client Number':[11,12,13],
 'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
print(df1.shape)  # Shape is 3x6 but it has to be 3x11
# Get missing columns in the training test
missing_cols = set( df.columns ) - set( df1.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    df1[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
df1 = df1[df.columns]
print(df1.shape)  # Shape is 3x11

Further, you mixed up x_new and y_new .此外，您混淆了x_new和y_new 。 So it should be:所以应该是：

x_new = df1.drop(['Total'], axis=1).values
y_new = df1['Total'].values
mod = sm.OLS(y_new, x_new)

mod.predict(reg.params)

Note that I used x_new = df1.drop(['Total'], axis=1).values instead of df1[['Lisbon','Tokyo','Visa','No','Yes']] as it is more convenient (in terms of 1) less prone to (typing)errors and 2) less code请注意，我使用x_new = df1.drop(['Total'], axis=1).values而不是df1[['Lisbon','Tokyo','Visa','No','Yes']]更方便（就 1）而言更不容易（打字）错误和 2）代码更少

Answer 3

First, you need to either string-index all the words, or one-hot encode the values.首先，您需要对所有单词进行字符串索引，或者对值进行单热编码。 ML models don't accept words, only numbers. ML 模型不接受文字，只接受数字。 Next, you want you X and y to be:接下来，您希望 X 和 y 为：

X = d.iloc[:,:-1]
y = d.iloc[:,-1]

This way X has a shape of [11,3] and y has a shape of [11,], which is the proper shapes needed.这样，X 的形状为 [11,3]，而 y 的形状为 [11,]，这是所需的正确形状。

无法对 OLS 进行预测 Model

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-08-19 09:20:49

解决方案2
0 2020-08-19 09:03:11

解决方案3
0 2020-08-19 09:05:53

无法对 OLS 进行预测 Model

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-08-19 09:20:49

解决方案2 0 2020-08-19 09:03:11

解决方案3 0 2020-08-19 09:05:53

解决方案1
1 已采纳 2020-08-19 09:20:49

解决方案2
0 2020-08-19 09:03:11

解决方案3
0 2020-08-19 09:05:53