简体   繁体   English

仅使用 Pandas 来填补空白,而不是在末端使用 NaN

[英]Using pandas to fill gaps only, and not NaNs on the ends

I have some housing price data that spans about 8 months, and tracks the price as houses come onto the market up until they are sold.我有一些跨越大约 8 个月的房价数据,并跟踪房屋上市直至售出时的价格。 There are a couple gaps in the data in the middle that I'd like to fill in, but I'd like to leave the NaNs on the end of each untouched.我想填充中间的数据中的几个空白,但我想保留每个末尾的 NaN 不变。

To use a simple example, let's say we have house1 that comes on the market for 200000 on 'Day 4', and sells for 190000 on 'Day 9'.举一个简单的例子,假设我们有 house1,它在“第 4 天”以 200000 的价格上市,在“第 9 天”以 190000 的价格出售。 And we have house2 that stays at 180000 for Days 1 - 12 and doesn't sell in that time window.我们的 house2 在第 1 天到第 12 天保持在 180000 并且在那个时间窗口内不出售。 But, something went wrong on days 6 and 7 and I lost the data:但是,第 6 天和第 7 天出了点问题,我丢失了数据:

house1 = [NaN, NaN, NaN, 200000, 200000, NaN, NaN, 200000, 190000, NaN, NaN, NaN]
house2 = [180000, 180000, 180000, 180000, 180000, NaN, NaN, 180000, 180000, 180000, 180000, 180000]

Now imagine instead of regular arrays these were columns in Pandas Dataframes indexed by date.现在想象一下,这些是 Pandas Dataframes 中按日期索引的列,而不是常规数组。

The trouble is, the function I would normally use to fill the gaps here would be DataFrame.fillna() using either the backfill or ffill methods.问题是,我通常用来填补这里空白的函数是DataFrame.fillna()使用 backfill 或 ffill 方法。 If I use ffill, house1 returns this:如果我使用填充,house1 会返回:

house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, 190000]

Which fills the gap, but also incorrectly fills the data past the day of sale.这填补了空白,但也错误地填充了销售日之后的数据。 If I use backfill instead, I get this:如果我改用回填,我会得到这个:

house1 = [200000, 200000, 200000, 200000, 200000, 200000, 200000, 200000, 190000, NaN, NaN, NaN]

Again, it fills the gap, but this time it also fills the front end of the data.再次,它填补了空白,但这次它也填补了数据的前端。 If I use 'limit=2' with ffill, then what I get is:如果我将 'limit=2' 与填充一起使用,那么我得到的是:

house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, NaN]

Once again, it fills the gap, but then it also begins to fill the data beyond the end of where the 'real' data ends.它再次填补了空白,但随后它也开始填充超出“真实”数据结束位置的数据。

My solution so far was to write the following function:到目前为止,我的解决方案是编写以下函数:

def fillGaps(houseDF):
    """Fills up holes in the housing data"""

    def fillColumns(column):
        filled_col = column
        lastValue = None
        # Keeps track of if we are dealing with a gap in numbers
        gap = False
        i = 0
        for currentValue in filled_col:
            # Loops over all the nans before the numbers begin
            if not isANumber(currentValue) and lastValue is None:
                pass
            # Keeps track of the last number we encountered before a gap
            elif isANumber(currentValue) and (gap is False):
                lastIndex = i
                lastValue = currentValue
            # Notes when we encounter a gap in numbers
            elif not isANumber(currentValue):
                gap = True
            # Fills in the gap
            elif isANumber(currentValue):
                gapIndicies = range(lastIndex + 1, i)
                for j in gapIndicies:
                    filled_col[j] = lastValue
                gap = False
            i += 1
        return filled_col

    filled_df = houseDF.apply(fillColumns, axis=0)
    return filled_df

It simply skips all the NaNs in front, fills in the gaps (defined by groups of NaNs between real values), and doesn't fill in NaNs on the end.它只是跳过前面的所有 NaN,填充间隙(由真实值之间的 NaN 组定义),并且最后不填充 NaN。

Is there a cleaner way to do this, or a built-in pandas function I'm unaware of?有没有更干净的方法来做到这一点,或者我不知道的内置熊猫功能?

I found this answer a year later but needed it to work on a DataFrame with multiple columns, so I wanted to leave my solution here in case someone else needs the same.一年后我找到了这个答案,但需要它在具有多列的 DataFrame 上工作,所以我想把我的解决方案留在这里,以防其他人需要同样的解决方案。 My function is just an modified version of YS-L's我的功能只是YS-L的修改版

def fillna_downbet(df):
    df = df.copy()
    for col in df:
        non_nans = df[col][~df[col].apply(np.isnan)]
        start, end = non_nans.index[0], non_nans.index[-1]
        df[col].loc[start:end] = df[col].loc[start:end].fillna(method='ffill')
    return df

Thanks!谢谢!

Yet another solution for a DataFrame with multiple columns具有多列的 DataFrame 的另一种解决方案

df.fillna(method='ffill') + (df.fillna(method='bfill') * 0)

How does it work?它是如何工作的?

The first fillna does a forward fill of values.第一个fillna执行值的前向填充。 This is almost what we want, except it leaves a trail of filled values at the end of each series.这几乎就是我们想要的,除了它在每个系列的末尾留下一串填充值。

The second fillna does a backward fill of values which we multiply by zero.第二个fillna对我们乘以零的值进行向后填充。 The result is that our unwanted trailing values will be NaN, and everything else will be 0.结果是我们不需要的尾随值将是 NaN,而其他所有值都将是 0。

Finally, we add the two together, taking advantage of the fact that x + 0 = x and x + NaN = NaN.最后,我们将两者相加,利用 x + 0 = x 和 x + NaN = NaN 的事实。

You can use fillna on certain parts of the Series.您可以在系列的某些部分使用fillna Based on your description, fillna should only fill up the NaNs after the first non-NaN, and before the last non-NaN:根据您的描述, fillna应该只在第一个非 NaN 之后和最后一个非 NaN 之前填充 NaN:

import numpy as np
import pandas as pd


def fill_column(house):
    house = house.copy()
    non_nans = house[~house.apply(np.isnan)]
    start, end = non_nans.index[0], non_nans.index[-1]
    house.ix[start:end] = house.ix[start:end].fillna(method='ffill')
    return house


house1 = pd.Series([np.nan, np.nan, np.nan, 200000, 200000, np.nan, np.nan, 200000, 190000, np.nan, np.nan, np.nan])
print fill_column(house1)

Output:输出:

0        NaN
1        NaN
2        NaN
3     200000
4     200000
5     200000
6     200000
7     200000
8     190000
9        NaN
10       NaN
11       NaN

Note that this assumes that the Series contains at least two non-NaNs, corresponding to the prices on the first and last day.请注意,这假设系列包含至少两个非 NaN,对应于第一天和最后一天的价格。

Here is a function that works with modern pandas (>=1.1), with multiple gaps, with no gaps at all and-most importantly-with .groupby() as well:这是一个适用于现代熊猫 (>=1.1) 的函数,有多个间隙,完全没有间隙,最重要的是, .groupby()

def fill_gap(s, method="ffill"):
    """Fills true gap in series."""
    col = s.copy()
    first_idx = col.first_valid_index()
    last_idx = col.last_valid_index()
    col.loc[first_idx:last_idx] = col.loc[first_idx:last_idx].fillna(method=method)
    return col

Make sure the index is strictly ascending!确保索引严格升序!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM