i have data set for patients ,i want to handle missing value for these data, it contain both numerical and text, the idea that i want to handle based on subject id. Not replace based on columns only the data set looks like this
subject_id time heart_rate blood_pressure urine_color
1 1.10 23 60 red
1 2 40
2 3 60 80
2 4 dark yellow
i want to replace text data with most frequent patient's data and numeric with mean values for patient also, to be like this
subject_id time heart_rate blood_pressure urine_color
1 1.10 23 60 red
1 2 23 40 red
2 3 60 80 dark yellow
2 4 60 80 dark yellow
any one can help in this , all impute method i search about , use most frequent in column , or statistical analysis for the whole column
Use GroupBy.transform
with custom function for mean
with numeric columns and mode
for categoricals columns and replace missing values by DataFrame.fillna
:
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else x.mode().iat[0]
Alternative if possible all NaN
s values for categorical columns per group:
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else next(iter(x.mode()), None)
cols = df.columns.difference(['subject_id'])
df[cols] = df[cols].fillna(df.groupby('subject_id')[cols].transform(f))
print (df)
subject_id time heart_rate blood_pressure urine_color
0 1 1.1 23 60 red
1 1 2 23 40 red
2 2 3 60 80 dark yellow
3 2 4 60 80 dark yellow
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.