简体   繁体   中英

Backfill values in Pandas series when value matches another column

I have a DataFrame like this:

import numpy as np

raw_data = {'surface': [np.nan, np.nan, 'round', 'square'],
            'city': ['San Francisco', 'Miami', 'San Francisco', 'Miami']}

df = pd.DataFrame(raw_data, columns = ['surface', 'city'])

This looks like this:

        surface city
   0    NaN     San Francisco
   1    NaN     Miami
   2    round   San Francisco
   3    square  Miami

I need earliest instance of the San Francisco row to be filled with 'round', and the earlier Miami row to be filled with 'square'. Using .fillna(method='bfill') won't take into account other column values, and just fills all earlier rows with round.

The result would be:

        surface city
   0    round   San Francisco
   1    square  Miami
   2    round   San Francisco
   3    square  Miami

You can use groupby.bfill ; group data frame by city column and then use bfill :

df.groupby('city').bfill()

#  surface           city
#0  round   San Francisco
#1  square          Miami
#2  round   San Francisco
#3  square          Miami

[Modified based on the admirable answer from PSidom ]

Using groupby() is the key point indeed, but it might be confusing not to mention what bfill() does as it's not doing what you actually think it does.

Let's take a quick glance at the doc here . Instead of back filling the data like what the OP wants, it actually just fill in the missing data with non-missing data in the next column. It works great with groupby() in this case, while you also need to do groupby('*your group*').ffill() for forward filling in case that the data you have are more complicated.

For further illustration, let's modify your data like this:

import numpy as np
import pandas as pd

raw_data = {'surface': [np.nan, np.nan, 'round', 'square', np.nan, np.nan, np.nan, np.nan],
            'city': ['San Francisco', 'Miami', 'San Francisco', 'Miami', 'Miami', 'Miami', 'San Francisco', 'Miami']}
df = pd.DataFrame(raw_data, columns = ['surface', 'city'])
df

#   surface city
#0  NaN     San Francisco
#1  NaN     Miami
#2  round   San Francisco
#3  square  Miami
#4  NaN     Miami
#5  NaN     Miami
#6  NaN     San Francisco
#7  NaN     Miami

With only df.groupby('city').bfill() , you'll got:

df2 = df.groupby('city').bfill()
df2

#   surface city
#0  round   San Francisco
#1  square  Miami
#2  round   San Francisco
#3  square  Miami
#4  NaN     Miami
#5  NaN     Miami
#6  NaN     San Francisco
#7  NaN     Miami

See what is going on there? bfill() did the job in row 0 and 1, but remain row 4 ~ 7 unchanged. You should use both bfill() and ffill() instead. Maybe something like this:

df3 =  df2.groupby('city').ffill()
df3

#   surface city
#0  round   San Francisco
#1  square  Miami
#2  round   San Francisco
#3  square  Miami
#4  square  Miami
#5  square  Miami
#6  round   San Francisco
#7  square  Miami

To be noticed, you shouldn't use something like df.groupby('city').bfill().ffill() . It'll fill in something wrong there.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM