使用 Groupby 和單個 output 的 RMSE 和 MAPE 分數的多重線性回歸

Question

我有一個來自 1115 家商店的數據集，並且想通過運行線性回歸來預測Sales來報告 RMSE 和 MAPE。 問題是我需要為每組商店（1115 個回歸）運行回歸，然后為所有回歸報告 RMSE 和 MAPE 的單個值。

例如，所有 Store = 1 的行都將在回歸下運行，然后 Store = 2 的行也一起回歸，等等。
每個商店都有很多行數據（例如：商店 1 有 900 行，代表 900 記錄天的每日銷售額。商店 2 有另外 900 行等）

這是我發現的一種方法的示例代碼：

Y_pred2 = np.zeros(test_val.shape[0]) #create an array filled with placeholder zeroes

train_bystore = train2.groupby(['Store'])
test_bystore = test_val2.groupby(['Store'])

for i in range(1,1116):
    a = train_bystore.get_group(i)
    b = test_bystore.get_group(i)
    # create loop to instantiate datasets
    X_train = a.drop(['Store','Date','Sales','Customers'],axis=1).values
    X_val = b.drop(['Store','Date','Sales','Customers'],axis=1).values
    Y_train = a['Sales']
    Y_val = b['Sales'] 
    lr = LinearRegression()
    lr.fit(X_train,Y_train)
    # now to loop for predict
    pred = lr.predict(X_val)
    i=0
    for j in b.index:
        Y_pred2[j]=pred[i]
        i+=1
        
print('RMSE %0.3f' %np.sqrt(mean_squared_error(Y_pred2,Y_val)))
print('MAPE %0.3f%%' %(mean_absolute_percentage_error(Y_pred2,Y_val)*100))

output 是一個荒謬的數字（但它有效並且顯然是正確的）：

RMSE 2886004774448802.532
MAPE 345.733%

我嘗試復制此方法，但它拋出了一個錯誤：

Error: Found input variables with inconsistent numbers of samples: [34565, 31]

然后我自己嘗試了這種替代方法，我更喜歡這種方法。 但是，它不包含 RMSE 和 MAPE output 的行，因為我不知道如何以每個僅給出 1 分的方式處理它（對於所有 1115 次回歸），如上例所示：

Y_train2 = train2['Sales']
Y_val2 = test_val2['Sales']
X_train2 = train2.drop(['Date','Sales','Customers'],axis=1)
X_val2 = test_val2.drop(['Date','Sales','Customers'],axis=1)

def model_grp(xtrain, xvals, ytrain, yvals):
    return np.squeeze(LinearRegression().fit(xtrain, ytrain).predict(xvals))

X_train2.groupby('Store').apply(model_grp, xtrain= X_train2, xvals= X_val2, ytrain=Y_train2, yvals=Y_val2)

我這里也有錯誤：

Error: model_grp() got multiple values for argument 'xtrain'

幫助！

Answer 1

再次瀏覽代碼后，我似乎發現了我之前遇到的錯誤的原因：

Error: Found input variables with inconsistent numbers of samples: [34565, 31]

基本上，需要在 for 循環之外調用Y_val2 = test_val2['Sales'] ，因為test_val2已經使用test_bystore = test_val2.groupby(['Store'])行進行了 groupby 轉換。

這種轉換導致test_bystore['Sales']成為一個長度為 31 的數組。如果我在 for 循環中定義它，就會發生這種情況。

結果，如果我繼續錯誤地調用Y_pred2 Y_val2 = test_val2['Sales'] test_bystore['Sales'] for 循環

修正后的代碼如下：

Y_pred2 = np.zeros(test_val2.shape[0]) # call an array of zeroes to later fill with 'pred' values using for loop
Y_val2 = test_val2['Sales'] # call Y_val2 here as it remains the same for every instance of the looped regression

# the following grouby statements creates some sort of 3D dataset, so can't be printed
train_bystore = train2.groupby(['Store'])
test_bystore = test_val2.groupby(['Store'])

for i in range(1,1116):
    df1 = train_bystore.get_group(i)
    df2 = test_bystore.get_group(i)
    Y_train2 = df1['Sales']
#   Y_val = df2['Sales'] # incorrect to call here
    X_train2 = df1.drop(['Store','Date','Sales','Customers'],axis=1).values
    X_val2 = df2.drop(['Store','Date','Sales','Customers'],axis=1).values
    model = LinearRegression()
    pred = model.fit(X_train2, Y_train2).predict(X_val2) # for each regression from 1-1115, output a 'pred'
    i = 0
    for j in df2.index:
            Y_pred2[j] = pred[i] # place the 'pred' output into each row of Y_pred2
            i+=1

Model_2_RMSE = np.sqrt(mean_squared_error(Y_pred2,Y_val2))
Model_2_MAPE = (mean_absolute_percentage_error(Y_pred2,Y_val2)*100)

使用 Groupby 和單個 output 的 RMSE 和 MAPE 分數的多重線性回歸

問題描述

1 個解決方案

解決方案1
0 2022-01-14 11:10:09

使用 Groupby 和單個 output 的 RMSE 和 MAPE 分數的多重線性回歸

問題描述

1 個解決方案

解決方案1 0 2022-01-14 11:10:09

解決方案1
0 2022-01-14 11:10:09