简体   繁体   中英

Fill nan values in one column based on other columns

I am working on a dataset which consists of average age of marriage. On this dataset I am doing data cleaning job. While performing this process, I came across a feature where I had to fill the 'NaN' values in the location column. But in location column there are multiple unique values and I want to fill the nan values in location. I need some suggestion on how to fill these Nan values in column which had many unique values.

在此处输入图像描述

I have attached the dataset for reference, DataSet

I suggest doing it in 3 steps:

  1. Fill in the missing values of location with either the most common location or with a separate value "Unknown";
  2. Fill in the missing values of "age_of_marriage" with a mean value of this feature by location;
  3. If there are any missing values of "age_of_marriage" left, fill them in with the average value.
df = pd.read_csv('https://raw.githubusercontent.com/atharva07/Age-of-marriage/main/age_of_marriage_data.csv', sep=',')
df['location'] = df['location'].fillna('Unknown')
df['age_of_marriage'] = df.groupby(['location'])['age_of_marriage'].apply(lambda x: x.fillna(x.median()))
df['age_of_marriage'] = df['age_of_marriage'].fillna(df['age_of_marriage'].mean())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM