简体   繁体   中英

How to fillna to non-integer with mean for that group, and also replace all-NaN groups with 0

I want to do a special fillna() on the following data set, as follows:

name,spend,received
A,1012,1200
A,?,1500
B,1300,?
B,2000,2500
B,?,?
C,?,?
C,?,?
  • In this dataset ? means any non-integer value like na or ???

  • A spend value of ? of A,B,C rows has to be replaced with the mean of that group, ie ? should be replaced with np.mean(A),np.mean(B),np.mean(C)

  • for C there are no other values so it has to be 0

We can't directly apply fillna(np.mean) in this case.

Here's a solution:

df = df.replace("?", np.NaN)
df.spend = pd.to_numeric(df.spend)
df.recieved = pd.to_numeric(df.recieved)
df.loc[df.spend.isna(), "spend"] = df.groupby("name").transform("mean").loc[df.spend.isna(), "spend"]
df["spend"] = df.spend.fillna(0)

Result:

  name   spend  recieved
0    A  1012.0    1200.0
1    A  1012.0    1500.0
2    B  1300.0       NaN
3    B  2000.0    2500.0
4    B  1650.0       NaN
5    C     0.0       NaN
6    C     0.0       NaN

Solution:

  1. use pd.read_csv(..., na_values='?') to replace your NaNs at read-time
  2. we'll adapt the basic answer on replacing NaNs within a group with its mean
  3. your twist is that all-NaN groups will result in NaN mean, which should then itself be fillna() replaced with 0

So the key line is:

df['spend'] = df.groupby('name')['spend'].apply(lambda s: s.fillna(s.mean())).fillna(0)

Code:

import pandas as pd
from io import StringIO

dat = """name,spend,received
A,1012,1200
A,?,1500
B,1300,?
B,2000,2500
B,?,?
C,?,?
C,?,?"""

df = pd.read_csv(StringIO(dat), na_values='?')

  name   spend  received
0    A  1012.0    1200.0
1    A     NaN    1500.0
2    B  1300.0       NaN
3    B  2000.0    2500.0
4    B     NaN       NaN
5    C     NaN       NaN
6    C     NaN       NaN

df['spend'] = df.groupby('name')['spend'].apply(lambda s: s.fillna(s.mean())).fillna(0)

  name   spend  received
0    A  1012.0    1200.0
1    A  1012.0    1500.0
2    B  1300.0       NaN
3    B  2000.0    2500.0
4    B  1650.0       NaN
5    C     0.0       NaN
6    C     0.0       NaN

Assuming? could also be strings

import pandas as pd
import numpy as np

idx = ['A'] * 3 + ['B'] * 3 + ['C'] * 3
data = np.random.random_sample((9,2))

df = pd.DataFrame(index=idx, data=data[::], columns=['spend', 'recieved'])
df.index.name = 'name'

df.iloc[2, 1] = np.nan
df.iloc[1, 0] = 'ABCD'
df.iloc[4:6, 0] = np.nan

df

name    spend       recieved    
A       0.197366    0.467532
A       ABCD        0.256184
A       0.559562    NaN
B       0.59835     0.415382
B       NaN         0.163827
B       NaN         0.759888
C       0.897332    0.025344
C       0.782683    0.428465
C       0.201591    0.601339

Then

df = df.apply(pd.to_numeric, errors='coerce')

df['spend'] = df['spend'].groupby(level=0).transform(lambda x: x.fillna(x.mean()).fillna(0))
df['recieved'] = df['recieved'].groupby(level=0).transform(lambda x: x.fillna(x.mean()).fillna(0))

Which yields:

name spend      recieved        
A    0.197366   0.467532
A    0.378464   0.256184
A    0.559562   0.361858
B    0.598350   0.415382
B    0.598350   0.163827
B    0.598350   0.759888
C    0.897332   0.025344
C    0.782683   0.428465
C    0.201591   0.601339

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM