I want to do a special fillna()
on the following data set, as follows:
name,spend,received
A,1012,1200
A,?,1500
B,1300,?
B,2000,2500
B,?,?
C,?,?
C,?,?
In this dataset ?
means any non-integer value
like na
or ???
A spend
value of ?
of A,B,C rows has to be replaced with the mean of that group, ie ?
should be replaced with np.mean(A),np.mean(B),np.mean(C)
for C
there are no other values so it has to be 0
We can't directly apply fillna(np.mean)
in this case.
Here's a solution:
df = df.replace("?", np.NaN)
df.spend = pd.to_numeric(df.spend)
df.recieved = pd.to_numeric(df.recieved)
df.loc[df.spend.isna(), "spend"] = df.groupby("name").transform("mean").loc[df.spend.isna(), "spend"]
df["spend"] = df.spend.fillna(0)
Result:
name spend recieved
0 A 1012.0 1200.0
1 A 1012.0 1500.0
2 B 1300.0 NaN
3 B 2000.0 2500.0
4 B 1650.0 NaN
5 C 0.0 NaN
6 C 0.0 NaN
Solution:
pd.read_csv(..., na_values='?')
to replace your NaNs at read-timeSo the key line is:
df['spend'] = df.groupby('name')['spend'].apply(lambda s: s.fillna(s.mean())).fillna(0)
Code:
import pandas as pd
from io import StringIO
dat = """name,spend,received
A,1012,1200
A,?,1500
B,1300,?
B,2000,2500
B,?,?
C,?,?
C,?,?"""
df = pd.read_csv(StringIO(dat), na_values='?')
name spend received
0 A 1012.0 1200.0
1 A NaN 1500.0
2 B 1300.0 NaN
3 B 2000.0 2500.0
4 B NaN NaN
5 C NaN NaN
6 C NaN NaN
df['spend'] = df.groupby('name')['spend'].apply(lambda s: s.fillna(s.mean())).fillna(0)
name spend received
0 A 1012.0 1200.0
1 A 1012.0 1500.0
2 B 1300.0 NaN
3 B 2000.0 2500.0
4 B 1650.0 NaN
5 C 0.0 NaN
6 C 0.0 NaN
Assuming? could also be strings
import pandas as pd
import numpy as np
idx = ['A'] * 3 + ['B'] * 3 + ['C'] * 3
data = np.random.random_sample((9,2))
df = pd.DataFrame(index=idx, data=data[::], columns=['spend', 'recieved'])
df.index.name = 'name'
df.iloc[2, 1] = np.nan
df.iloc[1, 0] = 'ABCD'
df.iloc[4:6, 0] = np.nan
df
name spend recieved
A 0.197366 0.467532
A ABCD 0.256184
A 0.559562 NaN
B 0.59835 0.415382
B NaN 0.163827
B NaN 0.759888
C 0.897332 0.025344
C 0.782683 0.428465
C 0.201591 0.601339
Then
df = df.apply(pd.to_numeric, errors='coerce')
df['spend'] = df['spend'].groupby(level=0).transform(lambda x: x.fillna(x.mean()).fillna(0))
df['recieved'] = df['recieved'].groupby(level=0).transform(lambda x: x.fillna(x.mean()).fillna(0))
Which yields:
name spend recieved
A 0.197366 0.467532
A 0.378464 0.256184
A 0.559562 0.361858
B 0.598350 0.415382
B 0.598350 0.163827
B 0.598350 0.759888
C 0.897332 0.025344
C 0.782683 0.428465
C 0.201591 0.601339
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.