My DF looks like that
id zip location
X2 65123 Houston
T5 65123 Houston
A1 nan Houston
M8 89517 Berkley
X3 89518 Berkley
N2 nan Berkley
M9 nan nan
For some values in 'zip' I don't have the zipcode, but an entry in 'location'.
I'd like to fill the nan values in 'zip' with one of the zipcodes from the same location. Sometimes there are more than one option, eg for N2 there are two options 89517 and 89518, which one to pick doesn't matter that much. But I don't want to change the ones where I have nan's in zip and location. How can I do that?
Since you don't care which value to use, we can use the max
value:
>>> df['zip'] = df.groupby('location')['zip'].transform(lambda x: x.fillna(x.max())).astype(int)
>>> df
id zip location
0 X2 65123 Houston
1 T5 65123 Houston
2 A1 65123 Houston
3 M8 89517 Berkley
4 X3 89518 Berkley
5 N2 89518 Berkley
If you need to handle cases where zip
and location
are both NaN
s, first, filter out the subgroup:
>>> sub_df = df.loc[df[['zip', 'location']].notna().any(1)]
>>> df
id zip location
0 X2 65123.0 Houston
1 T5 65123.0 Houston
2 A1 NaN Houston
3 M7 NaN NaN # <-- added a line in between to show index is maintained
4 M8 89517.0 Berkley
5 X3 89518.0 Berkley
6 N2 NaN Berkley
7 M9 NaN NaN
>>> sub_df
id zip location
0 X2 65123.0 Houston
1 T5 65123.0 Houston
2 A1 NaN Houston # <-- No index 3
4 M8 89517.0 Berkley
5 X3 89518.0 Berkley
6 N2 NaN Berkley
Then perform the same operation (only this time you don't need to cast as int
since you will have NaN
s in your frame anyways):
df['zip'] = sub_df.groupby('location')['zip'].transform(lambda x: x.fillna(x.max()))
Result:
id zip location
0 X2 65123.0 Houston
1 T5 65123.0 Houston
2 A1 65123.0 Houston
3 M7 NaN NaN
4 M8 89517.0 Berkley
5 X3 89518.0 Berkley
6 N2 89518.0 Berkley
7 M9 NaN NaN
If you dont care about which value to fill in, one simple method is to sort the table by location and zip, then use fillna with method='ffill'
>>> df
zip location
0 65123.0 Houston
1 65123.0 Houston
2 NaN Houston
3 89517.0 Berkley
4 89518.0 Berkley
5 NaN Berkley
>>> df.sort_values(by=['location','zip']).fillna(method='ffill')
zip location
3 89517.0 Berkley
4 89518.0 Berkley
5 89518.0 Berkley
0 65123.0 Houston
1 65123.0 Houston
2 65123.0 Houston
Update: Below solution handles nan in location also. First with groupby function and then fillna by max within the group.
>>> df
zip location
0 65123.0 Houston
1 65123.0 Houston
2 NaN Houston
3 89517.0 Berkley
4 89518.0 Berkley
5 NaN Berkley
6 NaN NaN
>>> df['zip'] = df.groupby('location')['zip'].apply(lambda x:x.fillna(x.max()))
>>> df
zip location
0 65123.0 Houston
1 65123.0 Houston
2 65123.0 Houston
3 89517.0 Berkley
4 89518.0 Berkley
5 89518.0 Berkley
6 NaN NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.