简体   繁体   中英

Replace NaN values with other values from same column

My DF looks like that

id    zip     location
X2    65123   Houston
T5    65123   Houston
A1    nan     Houston
M8    89517   Berkley
X3    89518   Berkley
N2    nan     Berkley
M9    nan     nan

For some values in 'zip' I don't have the zipcode, but an entry in 'location'.
I'd like to fill the nan values in 'zip' with one of the zipcodes from the same location. Sometimes there are more than one option, eg for N2 there are two options 89517 and 89518, which one to pick doesn't matter that much. But I don't want to change the ones where I have nan's in zip and location. How can I do that?

Since you don't care which value to use, we can use the max value:

>>> df['zip'] = df.groupby('location')['zip'].transform(lambda x: x.fillna(x.max())).astype(int)
>>> df

   id    zip location
0  X2  65123  Houston
1  T5  65123  Houston
2  A1  65123  Houston
3  M8  89517  Berkley
4  X3  89518  Berkley
5  N2  89518  Berkley

If you need to handle cases where zip and location are both NaN s, first, filter out the subgroup:

>>> sub_df = df.loc[df[['zip', 'location']].notna().any(1)]
>>> df
   id      zip location
0  X2  65123.0  Houston
1  T5  65123.0  Houston
2  A1      NaN  Houston
3  M7      NaN      NaN    # <-- added a line in between to show index is maintained
4  M8  89517.0  Berkley
5  X3  89518.0  Berkley
6  N2      NaN  Berkley
7  M9      NaN      NaN

>>> sub_df
   id      zip location
0  X2  65123.0  Houston
1  T5  65123.0  Houston
2  A1      NaN  Houston    # <-- No index 3
4  M8  89517.0  Berkley
5  X3  89518.0  Berkley
6  N2      NaN  Berkley

Then perform the same operation (only this time you don't need to cast as int since you will have NaN s in your frame anyways):

df['zip'] = sub_df.groupby('location')['zip'].transform(lambda x: x.fillna(x.max()))

Result:

   id      zip location
0  X2  65123.0  Houston
1  T5  65123.0  Houston
2  A1  65123.0  Houston
3  M7      NaN      NaN
4  M8  89517.0  Berkley
5  X3  89518.0  Berkley
6  N2  89518.0  Berkley
7  M9      NaN      NaN

If you dont care about which value to fill in, one simple method is to sort the table by location and zip, then use fillna with method='ffill'

 >>> df
       zip location
0  65123.0  Houston
1  65123.0  Houston
2      NaN  Houston
3  89517.0  Berkley
4  89518.0  Berkley
5      NaN  Berkley

>>> df.sort_values(by=['location','zip']).fillna(method='ffill')
       zip location
3  89517.0  Berkley
4  89518.0  Berkley
5  89518.0  Berkley
0  65123.0  Houston
1  65123.0  Houston
2  65123.0  Houston

Update: Below solution handles nan in location also. First with groupby function and then fillna by max within the group.

>>> df
       zip location
0  65123.0  Houston
1  65123.0  Houston
2      NaN  Houston
3  89517.0  Berkley
4  89518.0  Berkley
5      NaN  Berkley
6      NaN      NaN

>>> df['zip'] = df.groupby('location')['zip'].apply(lambda x:x.fillna(x.max()))
>>> df
       zip location
0  65123.0  Houston
1  65123.0  Houston
2  65123.0  Houston
3  89517.0  Berkley
4  89518.0  Berkley
5  89518.0  Berkley
6      NaN      NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM