简体   繁体   中英

Python-pandas Replace NA with the median or mean of a group in dataframe

Suppose we have a df:

    A       B
   apple   1.0
   apple   2.0
   apple    NA
   orange   NA
   orange  7.0
   melon   14.0
   melon   NA
   melon   15.0
   melon   16.0

to replace the NA, we can use df["B"].fillna(df["B"].median()), but it will fill NA with the median of all data in "B"

Is there any way that we can use the median of a certain A to replace the NA (like below):

    A       B
   apple   1.0
   apple   2.0
   apple   **1.5**
   orange  **7.0**
   orange  7.0
   melon   14.0
   melon   **15.0**
   melon   15.0
   melon   16.0

Thanks!

In pandas you may use transform to obtain null-fill values:

>>> med = df.groupby('A')['B'].transform('median')
>>> df['B'].fillna(med)
0     1.0
1     2.0
2     1.5
3     7.0
4     7.0
5    14.0
6    15.0
7    15.0
8    16.0
Name: B, dtype: float64

In R , can use na.aggregate/data.table to replace the NA by mean value of the group. We convert the 'data.frame' to 'data.table' ( setDT(df) ), grouped by 'A', apply the na.aggregate on 'B'.

library(zoo)
library(data.table)
setDT(df)[,  B:= na.aggregate(B), A]
df
#      A    B
#1:  apple  1.0
#2:  apple  2.0
#3:  apple  1.5
#4: orange  7.0
#5: orange  7.0
#6:  melon 14.0
#7:  melon 15.0
#8:  melon 15.0
#9:  melon 16.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM