简体   繁体   中英

Multiple Linear Regression with Groupby and single output of RMSE and MAPE scores

train2 和 test_val2 数据集

I have a dataset from 1115 stores, and would like to report on RMSE and MAPE by running a Linear Regression to predict for Sales . The issue is that I need to run a regression for each group of Stores (1115 regressions) but then report on a single value for RMSE and MAPE for all the regressions.

  • For example, all rows with Store = 1 will be run under a regression, then rows with Store =2 are also regressed together, etc.
  • Each store has many rows of data (eg: Store 1 has 900 rows, representing daily sales over 900 recorded days. Store 2 has another 900 rows, etc)

Here's an example code of one method I found:

Y_pred2 = np.zeros(test_val.shape[0]) #create an array filled with placeholder zeroes

train_bystore = train2.groupby(['Store'])
test_bystore = test_val2.groupby(['Store'])

for i in range(1,1116):
    a = train_bystore.get_group(i)
    b = test_bystore.get_group(i)
    # create loop to instantiate datasets
    X_train = a.drop(['Store','Date','Sales','Customers'],axis=1).values
    X_val = b.drop(['Store','Date','Sales','Customers'],axis=1).values
    Y_train = a['Sales']
    Y_val = b['Sales'] 
    lr = LinearRegression()
    lr.fit(X_train,Y_train)
    # now to loop for predict
    pred = lr.predict(X_val)
    i=0
    for j in b.index:
        Y_pred2[j]=pred[i]
        i+=1
        
print('RMSE %0.3f' %np.sqrt(mean_squared_error(Y_pred2,Y_val)))
print('MAPE %0.3f%%' %(mean_absolute_percentage_error(Y_pred2,Y_val)*100))

The output was some ridiculous number (but it worked and is apparently correct):

RMSE 2886004774448802.532
MAPE 345.733%

I tried copying this method but it throws me an error:

Error: Found input variables with inconsistent numbers of samples: [34565, 31]

Then I tried this alternative method on my own, which I prefer. However, it doesn't contain lines for the RMSE and MAPE output, because I am not sure how to handle it in a way that gives only 1 score of each (for all 1115 regressions) as in the example above:

Y_train2 = train2['Sales']
Y_val2 = test_val2['Sales']
X_train2 = train2.drop(['Date','Sales','Customers'],axis=1)
X_val2 = test_val2.drop(['Date','Sales','Customers'],axis=1)

def model_grp(xtrain, xvals, ytrain, yvals):
    return np.squeeze(LinearRegression().fit(xtrain, ytrain).predict(xvals))

X_train2.groupby('Store').apply(model_grp, xtrain= X_train2, xvals= X_val2, ytrain=Y_train2, yvals=Y_val2)

I still got an error here too:

Error: model_grp() got multiple values for argument 'xtrain'

Help!

After going through the code again I seem to have discovered the reason for the error I encountered earlier:

Error: Found input variables with inconsistent numbers of samples: [34565, 31]

Basically, Y_val2 = test_val2['Sales'] needed to be called outside of the for loop, since test_val2 underwent a groupby transformation already with the line test_bystore = test_val2.groupby(['Store']) .

This transformation caused test_bystore['Sales'] to be an array of length 31. This would have occured if I defined it within the for loop.

As a result, the RMSE and MAPE score calculation could not be performed (it was comparing Y_pred2 of length 34565 with test_bystore['Sales'] of length 31) if I had continued to incorrectly call Y_val2 = test_val2['Sales'] within the for loop

The corrected code is as follows:

Y_pred2 = np.zeros(test_val2.shape[0]) # call an array of zeroes to later fill with 'pred' values using for loop
Y_val2 = test_val2['Sales'] # call Y_val2 here as it remains the same for every instance of the looped regression

# the following grouby statements creates some sort of 3D dataset, so can't be printed
train_bystore = train2.groupby(['Store'])
test_bystore = test_val2.groupby(['Store'])

for i in range(1,1116):
    df1 = train_bystore.get_group(i)
    df2 = test_bystore.get_group(i)
    Y_train2 = df1['Sales']
#   Y_val = df2['Sales'] # incorrect to call here
    X_train2 = df1.drop(['Store','Date','Sales','Customers'],axis=1).values
    X_val2 = df2.drop(['Store','Date','Sales','Customers'],axis=1).values
    model = LinearRegression()
    pred = model.fit(X_train2, Y_train2).predict(X_val2) # for each regression from 1-1115, output a 'pred'
    i = 0
    for j in df2.index:
            Y_pred2[j] = pred[i] # place the 'pred' output into each row of Y_pred2
            i+=1

Model_2_RMSE = np.sqrt(mean_squared_error(Y_pred2,Y_val2))
Model_2_MAPE = (mean_absolute_percentage_error(Y_pred2,Y_val2)*100)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM