简体繁体中英

Impute missing values with mean of column in machine learning

原文 2020-02-23 14:58:00 1 1 python/ machine-learning/ data-science

I know that imputing missing values is exactly what it sounds, i'm talking about imputing it with mean of the column. I usually impute missing values before i split the data into train and test but then i saw this QnA that said

CAUTION: if you want to use this for Machine Learning / Data Science: from a Data Science perspective it is wrong to first replace NA and then split into train and test... You MUST first split into train and test, then replace NA by mean on train and then apply this stateful preprocessing model to test, see the answer involving sklearn below! – Fabian Werner Aug 28 '19 at 9:18

What does it mean by that? can we do it? and how do we do it? is there any different between doing it before or after splitting the data? if yes, why? Please help me to understand because i'm quite confused over this thing.

1 answers

Yes, this is a correct statement. You should at first split the data into train and valid/test data, calculate the mean on the train data and apply it to valid/test data.

In fact this is relevant to any processing which is based on the data itself. If you calculate and transform on the whole dataset, you leak information into the data. But we want to have a correct validation, so valid/test dataset should be processed exactly like train

Is there a way to impute missing values in machine learning?

Impute NaNs with the mean in column and find percentage of missing values

How to Impute Missing Values When Running Machine Learning Binary Classification Using Multiple Text Input Features

Missing values in scikits machine learning

How to impute entire missing values in pandas dataframe with mode/mean?

Impute categorical missing values in scikit-learn using specific column

Predict NA (missing values) with machine learning

How to impute missing values with KNN

Impute missing values for testing set

Impute mean of single column in dask-ml

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Is there a way to impute missing values in machine learning? Impute NaNs with the mean in column and find percentage of missing values How to Impute Missing Values When Running Machine Learning Binary Classification Using Multiple Text Input Features Missing values in scikits machine learning How to impute entire missing values in pandas dataframe with mode/mean? Impute categorical missing values in scikit-learn using specific column Predict NA (missing values) with machine learning How to impute missing values with KNN Impute missing values for testing set Impute mean of single column in dask-ml

Related Tags

Impute missing values with mean of column in machine learning

Question

1 answers

solution1 1 ACCPTED 2020-02-23 15:41:50

solution1
1 ACCPTED 2020-02-23 15:41:50