[英]Multiple Linear Regression with Groupby and single output of RMSE and MAPE scores
I have a dataset from 1115 stores, and would like to report on RMSE and MAPE by running a Linear Regression to predict for Sales
.我有一个来自 1115 家商店的数据集,并且想通过运行线性回归来预测
Sales
来报告 RMSE 和 MAPE。 The issue is that I need to run a regression for each group of Stores (1115 regressions) but then report on a single value for RMSE and MAPE for all the regressions.问题是我需要为每组商店(1115 个回归)运行回归,然后为所有回归报告 RMSE 和 MAPE 的单个值。
Here's an example code of one method I found:这是我发现的一种方法的示例代码:
Y_pred2 = np.zeros(test_val.shape[0]) #create an array filled with placeholder zeroes
train_bystore = train2.groupby(['Store'])
test_bystore = test_val2.groupby(['Store'])
for i in range(1,1116):
a = train_bystore.get_group(i)
b = test_bystore.get_group(i)
# create loop to instantiate datasets
X_train = a.drop(['Store','Date','Sales','Customers'],axis=1).values
X_val = b.drop(['Store','Date','Sales','Customers'],axis=1).values
Y_train = a['Sales']
Y_val = b['Sales']
lr = LinearRegression()
lr.fit(X_train,Y_train)
# now to loop for predict
pred = lr.predict(X_val)
i=0
for j in b.index:
Y_pred2[j]=pred[i]
i+=1
print('RMSE %0.3f' %np.sqrt(mean_squared_error(Y_pred2,Y_val)))
print('MAPE %0.3f%%' %(mean_absolute_percentage_error(Y_pred2,Y_val)*100))
The output was some ridiculous number (but it worked and is apparently correct): output 是一个荒谬的数字(但它有效并且显然是正确的):
RMSE 2886004774448802.532
MAPE 345.733%
I tried copying this method but it throws me an error:我尝试复制此方法,但它抛出了一个错误:
Error: Found input variables with inconsistent numbers of samples: [34565, 31]
Then I tried this alternative method on my own, which I prefer.然后我自己尝试了这种替代方法,我更喜欢这种方法。 However, it doesn't contain lines for the RMSE and MAPE output, because I am not sure how to handle it in a way that gives only 1 score of each (for all 1115 regressions) as in the example above:
但是,它不包含 RMSE 和 MAPE output 的行,因为我不知道如何以每个仅给出 1 分的方式处理它(对于所有 1115 次回归),如上例所示:
Y_train2 = train2['Sales']
Y_val2 = test_val2['Sales']
X_train2 = train2.drop(['Date','Sales','Customers'],axis=1)
X_val2 = test_val2.drop(['Date','Sales','Customers'],axis=1)
def model_grp(xtrain, xvals, ytrain, yvals):
return np.squeeze(LinearRegression().fit(xtrain, ytrain).predict(xvals))
X_train2.groupby('Store').apply(model_grp, xtrain= X_train2, xvals= X_val2, ytrain=Y_train2, yvals=Y_val2)
I still got an error here too:我这里也有错误:
Error: model_grp() got multiple values for argument 'xtrain'
Help!帮助!
After going through the code again I seem to have discovered the reason for the error I encountered earlier:再次浏览代码后,我似乎发现了我之前遇到的错误的原因:
Error: Found input variables with inconsistent numbers of samples: [34565, 31]
Basically, Y_val2 = test_val2['Sales']
needed to be called outside of the for loop, since test_val2
underwent a groupby transformation already with the line test_bystore = test_val2.groupby(['Store'])
.基本上,需要在 for 循环之外调用
Y_val2 = test_val2['Sales']
,因为test_val2
已经使用test_bystore = test_val2.groupby(['Store'])
行进行了 groupby 转换。
This transformation caused test_bystore['Sales']
to be an array of length 31. This would have occured if I defined it within the for loop.这种转换导致
test_bystore['Sales']
成为一个长度为 31 的数组。如果我在 for 循环中定义它,就会发生这种情况。
As a result, the RMSE and MAPE score calculation could not be performed (it was comparing Y_pred2
of length 34565 with test_bystore['Sales']
of length 31) if I had continued to incorrectly call Y_val2 = test_val2['Sales']
within the for loop结果,如果我继续错误地调用
Y_pred2
Y_val2 = test_val2['Sales']
test_bystore['Sales']
for 循环
The corrected code is as follows:修正后的代码如下:
Y_pred2 = np.zeros(test_val2.shape[0]) # call an array of zeroes to later fill with 'pred' values using for loop
Y_val2 = test_val2['Sales'] # call Y_val2 here as it remains the same for every instance of the looped regression
# the following grouby statements creates some sort of 3D dataset, so can't be printed
train_bystore = train2.groupby(['Store'])
test_bystore = test_val2.groupby(['Store'])
for i in range(1,1116):
df1 = train_bystore.get_group(i)
df2 = test_bystore.get_group(i)
Y_train2 = df1['Sales']
# Y_val = df2['Sales'] # incorrect to call here
X_train2 = df1.drop(['Store','Date','Sales','Customers'],axis=1).values
X_val2 = df2.drop(['Store','Date','Sales','Customers'],axis=1).values
model = LinearRegression()
pred = model.fit(X_train2, Y_train2).predict(X_val2) # for each regression from 1-1115, output a 'pred'
i = 0
for j in df2.index:
Y_pred2[j] = pred[i] # place the 'pred' output into each row of Y_pred2
i+=1
Model_2_RMSE = np.sqrt(mean_squared_error(Y_pred2,Y_val2))
Model_2_MAPE = (mean_absolute_percentage_error(Y_pred2,Y_val2)*100)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.