I have a pandas DataFrame. I'm trying to fill the nans of the Price column based on the average price of the corresponding level in the Section column. What's an efficient and elegant way to do this? My data looks something like this
Name Sex Section Price
Joe M 1 2
Bob M 1 nan
Nancy F 2 5
Grace F 1 6
Jen F 2 3
Paul M 2 nan
You could use combine groupby
, transform
, and mean
. Note that I've modified your example because otherwise both Sections have the same mean value. Starting from
In [21]: df
Out[21]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 NaN
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 NaN
we can use
df["Price"] = (df["Price"].fillna(df.groupby("Section")["Price"].transform("mean"))
to produce
In [23]: df
Out[23]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5
This works because we can compute the mean by Section:
In [29]: df.groupby("Section")["Price"].mean()
Out[29]:
Section
1 4.0
2 7.5
Name: Price, dtype: float64
and broadcast this back up to a full Series we can pass to fillna() using transform
:
In [30]: df.groupby("Section")["Price"].transform("mean")
Out[30]:
0 4.0
1 4.0
2 7.5
3 4.0
4 7.5
5 7.5
Name: Price, dtype: float64
pandas
surgical but slower pandas
solution This is a more surgical approach that may provide some perspective, possibly usefull
use groupyby
calculate our mean
for each Section
means = df.groupby('Section').Price.mean()
identify nulls
use isnull
to use for boolean slicing
nulls = df.Price.isnull()
use map
slice the Section
column to limit to just those rows with null Price
fills = df.Section[nulls].map(means)
use loc
fill in the spots in df
only where nulls are
df.loc[nulls, 'Price'] = fills
means = df.groupby('Section').Price.mean()
nulls = df.Price.isnull()
fills = df.Section[nulls].map(means)
df.loc[nulls, 'Price'] = fills
print(df)
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5
by "corresponding level" i am assuming you mean with equal section value.
if so, you can solve this by
for section_value in sorted(set(df.Section)):
df.loc[df['Section']==section_value, 'Price'] = df.loc[df['Section']==section_value, 'Price'].fillna(df.loc[df['Section']==section_value, 'Price'].mean())
hope it helps! peace
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.