简体   繁体   中英

Overfitting in linear regression

  1. I am fitting a linear regression model for avg flight seats booked (percentage). But when I use predict for a farout date: it is coming as > 100%, which is not possible. How to avoid this?
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

df = pd.read_csv('Flightseats.csv')
df['Date'] = pd.to_datetime(df['Date'], format = '%d-%m-%Y')

min_date = df['Date'].min()
df['days'] = (df['Date'] - min_date).dt.days
df['day_index'] = df.Date.dt.dayofweek

model = LinearRegression()
model.fit(df[['days', 'day_index']], df.Percentage)

model.predict(np.array([[3000, 2]]))
  1. How to normalize the df['days'] for train and test set? or should not normalize at all? as I think above issue might be due to days term
max_days = df['days'].max()

train['normalized_days'] = train['days']/max_days
test['normalized_days'] = test['days']/max_days

That's not overfitting. That's what linear functions do. Linear functions are not bounded in their output, and so if you input very large numbers you will get very large outputs. Using a linear model for your case would only be justified as an approximation of the true behavior over a bounded interval.

There is no "right" approach here. You need to explore the data you are trying to predict and choose an adequate model. If you want to predict arbitrarily into the future, then it's not LinearRegression.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM