简体   繁体   中英

Impute missing values with mean of column in machine learning

I know that imputing missing values is exactly what it sounds, i'm talking about imputing it with mean of the column. I usually impute missing values before i split the data into train and test but then i saw this QnA that said

CAUTION: if you want to use this for Machine Learning / Data Science: from a Data Science perspective it is wrong to first replace NA and then split into train and test... You MUST first split into train and test, then replace NA by mean on train and then apply this stateful preprocessing model to test, see the answer involving sklearn below! – Fabian Werner Aug 28 '19 at 9:18

What does it mean by that? can we do it? and how do we do it? is there any different between doing it before or after splitting the data? if yes, why? Please help me to understand because i'm quite confused over this thing.

Yes, this is a correct statement. You should at first split the data into train and valid/test data, calculate the mean on the train data and apply it to valid/test data.

In fact this is relevant to any processing which is based on the data itself. If you calculate and transform on the whole dataset, you leak information into the data. But we want to have a correct validation, so valid/test dataset should be processed exactly like train

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM