简体   繁体   中英

Target Variable has Outliers : Machine Learning Regression

I am currently working on a regression problem where the target variable has close to 2000 outliers against 54000 non outliers.

I would like to know how do we deal with data where the target variable has outliers??

Things i have tried so far:

  1. Taking entire train data including outliers - score is ok ok
  2. removing outliers in train data altogether - score is worse
  3. taking a 80%combination of outliers in train data - score improves

In my suggestion, If you have outliner in target variable then don't simply remove the rows from the data set instead try to bring them within the boundary limits.

You can determine the upper boundary and lower boundary but plotting box plot

import seaborn as sns     
sns.boxplot(x=dataset['target Variable'])

Also, You can count the total number of occurrences of each value in the target variable using

dataset['target variable'].value_counts()

And then set the upper bound and lower bound using the following code

dataset.loc[dataset['target variable'] > upper_bound, 'target variable'] = upper_limit
dataset.loc[dataset['target variable'] < Lower_bound, 'target variable'] = Lower_limit

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM