简体   繁体   中英

Using pandas to fill gaps only, and not NaNs on the ends

I have some housing price data that spans about 8 months, and tracks the price as houses come onto the market up until they are sold. There are a couple gaps in the data in the middle that I'd like to fill in, but I'd like to leave the NaNs on the end of each untouched.

To use a simple example, let's say we have house1 that comes on the market for 200000 on 'Day 4', and sells for 190000 on 'Day 9'. And we have house2 that stays at 180000 for Days 1 - 12 and doesn't sell in that time window. But, something went wrong on days 6 and 7 and I lost the data:

house1 = [NaN, NaN, NaN, 200000, 200000, NaN, NaN, 200000, 190000, NaN, NaN, NaN]
house2 = [180000, 180000, 180000, 180000, 180000, NaN, NaN, 180000, 180000, 180000, 180000, 180000]

Now imagine instead of regular arrays these were columns in Pandas Dataframes indexed by date.

The trouble is, the function I would normally use to fill the gaps here would be DataFrame.fillna() using either the backfill or ffill methods. If I use ffill, house1 returns this:

house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, 190000]

Which fills the gap, but also incorrectly fills the data past the day of sale. If I use backfill instead, I get this:

house1 = [200000, 200000, 200000, 200000, 200000, 200000, 200000, 200000, 190000, NaN, NaN, NaN]

Again, it fills the gap, but this time it also fills the front end of the data. If I use 'limit=2' with ffill, then what I get is:

house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, NaN]

Once again, it fills the gap, but then it also begins to fill the data beyond the end of where the 'real' data ends.

My solution so far was to write the following function:

def fillGaps(houseDF):
    """Fills up holes in the housing data"""

    def fillColumns(column):
        filled_col = column
        lastValue = None
        # Keeps track of if we are dealing with a gap in numbers
        gap = False
        i = 0
        for currentValue in filled_col:
            # Loops over all the nans before the numbers begin
            if not isANumber(currentValue) and lastValue is None:
                pass
            # Keeps track of the last number we encountered before a gap
            elif isANumber(currentValue) and (gap is False):
                lastIndex = i
                lastValue = currentValue
            # Notes when we encounter a gap in numbers
            elif not isANumber(currentValue):
                gap = True
            # Fills in the gap
            elif isANumber(currentValue):
                gapIndicies = range(lastIndex + 1, i)
                for j in gapIndicies:
                    filled_col[j] = lastValue
                gap = False
            i += 1
        return filled_col

    filled_df = houseDF.apply(fillColumns, axis=0)
    return filled_df

It simply skips all the NaNs in front, fills in the gaps (defined by groups of NaNs between real values), and doesn't fill in NaNs on the end.

Is there a cleaner way to do this, or a built-in pandas function I'm unaware of?

I found this answer a year later but needed it to work on a DataFrame with multiple columns, so I wanted to leave my solution here in case someone else needs the same. My function is just an modified version of YS-L's

def fillna_downbet(df):
    df = df.copy()
    for col in df:
        non_nans = df[col][~df[col].apply(np.isnan)]
        start, end = non_nans.index[0], non_nans.index[-1]
        df[col].loc[start:end] = df[col].loc[start:end].fillna(method='ffill')
    return df

Thanks!

Yet another solution for a DataFrame with multiple columns

df.fillna(method='ffill') + (df.fillna(method='bfill') * 0)

How does it work?

The first fillna does a forward fill of values. This is almost what we want, except it leaves a trail of filled values at the end of each series.

The second fillna does a backward fill of values which we multiply by zero. The result is that our unwanted trailing values will be NaN, and everything else will be 0.

Finally, we add the two together, taking advantage of the fact that x + 0 = x and x + NaN = NaN.

You can use fillna on certain parts of the Series. Based on your description, fillna should only fill up the NaNs after the first non-NaN, and before the last non-NaN:

import numpy as np
import pandas as pd


def fill_column(house):
    house = house.copy()
    non_nans = house[~house.apply(np.isnan)]
    start, end = non_nans.index[0], non_nans.index[-1]
    house.ix[start:end] = house.ix[start:end].fillna(method='ffill')
    return house


house1 = pd.Series([np.nan, np.nan, np.nan, 200000, 200000, np.nan, np.nan, 200000, 190000, np.nan, np.nan, np.nan])
print fill_column(house1)

Output:

0        NaN
1        NaN
2        NaN
3     200000
4     200000
5     200000
6     200000
7     200000
8     190000
9        NaN
10       NaN
11       NaN

Note that this assumes that the Series contains at least two non-NaNs, corresponding to the prices on the first and last day.

Here is a function that works with modern pandas (>=1.1), with multiple gaps, with no gaps at all and-most importantly-with .groupby() as well:

def fill_gap(s, method="ffill"):
    """Fills true gap in series."""
    col = s.copy()
    first_idx = col.first_valid_index()
    last_idx = col.last_valid_index()
    col.loc[first_idx:last_idx] = col.loc[first_idx:last_idx].fillna(method=method)
    return col

Make sure the index is strictly ascending!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM