简体   繁体   English

Pandas:groupby 向前填充日期时间索引

[英]Pandas: groupby forward fill with datetime index

I have a dataset that has two columns: company, and value.我有一个包含两列的数据集:公司和价值。
It has a datetime index, which contains duplicates (on the same day, different companies have different values).它有一个日期时间索引,其中包含重复项(同一天,不同的公司有不同的值)。 The values have missing data, so I want to forward fill the missing data with the previous datapoint from the same company.这些值缺少数据,所以我想用来自同一家公司的以前的数据点向前填充缺失的数据。

However, I can't seem to find a good way to do this without running into odd groupby errors, suggesting that I'm doing something wrong.但是,我似乎无法找到一种很好的方法来做到这一点,而不会遇到奇怪的 groupby 错误,这表明我做错了什么。

Toy data:玩具数据:

a = pd.DataFrame({'a': [1, 2, None], 'b': [12,None,14]})
a.index = pd.DatetimeIndex(['2010', '2011', '2012'])  
a = a.unstack() 
a = a.reset_index().set_index('level_1') 
a.columns = ['company', 'value'] 
a.sort_index(inplace=True)

Attempted solutions (didn't work: ValueError: cannot reindex from a duplicate axis ):尝试的解决方案(不起作用: ValueError: cannot reindex from a duplicate axis ):

a.groupby('company').ffill() 
a.groupby('company')['value'].ffill() 
a.groupby('company').fillna(method='ffill')

Hacky solution (that delivers the desired result, but is obviously just an ugly workaround): Hacky 解决方案(提供所需的结果,但显然只是一个丑陋的解决方法):

a['value'] = a.reset_index().groupby(
    'company').fillna(method='ffill')['value'].values

There is probably a simple and elegant way to do this, how is this performed in Pandas?可能有一种简单而优雅的方法可以做到这一点,这在 Pandas 中是如何执行的?

One way is to use the transform function to fill the value column after group by:一种方法是使用transform函数在group by后填充value列:

import pandas as pd
a['value'] = a.groupby('company')['value'].transform(lambda v: v.ffill())

a
#          company  value
#level_1        
#2010-01-01      a    1.0
#2010-01-01      b   12.0
#2011-01-01      a    2.0
#2011-01-01      b   12.0
#2012-01-01      a    2.0
#2012-01-01      b   14.0

To compare, the original data frame looks like:为了比较,原始数据框如下所示:

#            company    value
#level_1        
#2010-01-01        a      1.0
#2010-01-01        b     12.0
#2011-01-01        a      2.0
#2011-01-01        b      NaN
#2012-01-01        a      NaN
#2012-01-01        b     14.0

You can add 'company' to the index, making it unique, and do a simple ffill via groupby :您可以将'company'添加到索引中,使其唯一,并通过groupby进行简单的ffill

a = a.set_index('company', append=True)
a = a.groupby(level=1).ffill()

From here, you can use reset_index to revert the index back to the just the date, if necessary.从这里开始,如有必要,您可以使用reset_index将索引恢复为日期。 I'd recommend keeping 'company' as part of the the index (or just adding it to the index to begin with), so your index remains unique:我建议将'company'作为索引的一部分(或者只是将其添加到索引中),这样您的索引就保持唯一:

a = a.reset_index(level=1)

I like to use stacking and unstacking.我喜欢使用堆叠和拆垛。 In this case, it requires that I append the index with 'company' .在这种情况下,它要求我在索引后附加'company'

a.set_index('company', append=True).unstack().ffill() \
                                   .stack().reset_index('company')

在此处输入图片说明


Timing时机

Conclusion @Psidom's solution works best under both scenarios.结论@Psidom 的解决方案在这两种情况下都效果最好。

toy data玩具数据

在此处输入图片说明

bigger toy更大的玩具

np.random.seed([3,1415])
n = 10000
a = pd.DataFrame(np.random.randn(n, 10),
                 pd.date_range('2014-01-01', periods=n, freq='H', name='Time'),
                 pd.Index(list('abcdefghij'), name='company'))

a *= np.random.choice((1, np.nan), (n, 10), p=(.6, .4))

a = a.stack(dropna=False).rename('value').reset_index('company')

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM