Pandas：groupby 向前填充日期时间索引

Question

I have a dataset that has two columns: company, and value.我有一个包含两列的数据集：公司和价值。
It has a datetime index, which contains duplicates (on the same day, different companies have different values).它有一个日期时间索引，其中包含重复项（同一天，不同的公司有不同的值）。 The values have missing data, so I want to forward fill the missing data with the previous datapoint from the same company.这些值缺少数据，所以我想用来自同一家公司的以前的数据点向前填充缺失的数据。

However, I can't seem to find a good way to do this without running into odd groupby errors, suggesting that I'm doing something wrong.但是，我似乎无法找到一种很好的方法来做到这一点，而不会遇到奇怪的 groupby 错误，这表明我做错了什么。

Toy data:玩具数据：

a = pd.DataFrame({'a': [1, 2, None], 'b': [12,None,14]})
a.index = pd.DatetimeIndex(['2010', '2011', '2012'])  
a = a.unstack() 
a = a.reset_index().set_index('level_1') 
a.columns = ['company', 'value'] 
a.sort_index(inplace=True)

Attempted solutions (didn't work: ValueError: cannot reindex from a duplicate axis ):尝试的解决方案（不起作用： ValueError: cannot reindex from a duplicate axis ）：

a.groupby('company').ffill() 
a.groupby('company')['value'].ffill() 
a.groupby('company').fillna(method='ffill')

Hacky solution (that delivers the desired result, but is obviously just an ugly workaround): Hacky 解决方案（提供所需的结果，但显然只是一个丑陋的解决方法）：

a['value'] = a.reset_index().groupby(
    'company').fillna(method='ffill')['value'].values

There is probably a simple and elegant way to do this, how is this performed in Pandas?可能有一种简单而优雅的方法可以做到这一点，这在 Pandas 中是如何执行的？

Answer 1

One way is to use the transform function to fill the value column after group by:一种方法是使用transform函数在group by后填充value列：

import pandas as pd
a['value'] = a.groupby('company')['value'].transform(lambda v: v.ffill())

a
#          company  value
#level_1        
#2010-01-01      a    1.0
#2010-01-01      b   12.0
#2011-01-01      a    2.0
#2011-01-01      b   12.0
#2012-01-01      a    2.0
#2012-01-01      b   14.0

To compare, the original data frame looks like:为了比较，原始数据框如下所示：

#            company    value
#level_1        
#2010-01-01        a      1.0
#2010-01-01        b     12.0
#2011-01-01        a      2.0
#2011-01-01        b      NaN
#2012-01-01        a      NaN
#2012-01-01        b     14.0

Answer 2

You can add 'company' to the index, making it unique, and do a simple ffill via groupby :您可以将'company'添加到索引中，使其唯一，并通过groupby进行简单的ffill ：

a = a.set_index('company', append=True)
a = a.groupby(level=1).ffill()

From here, you can use reset_index to revert the index back to the just the date, if necessary.从这里开始，如有必要，您可以使用reset_index将索引恢复为日期。 I'd recommend keeping 'company' as part of the the index (or just adding it to the index to begin with), so your index remains unique:我建议将'company'作为索引的一部分（或者只是将其添加到索引中），这样您的索引就保持唯一：

a = a.reset_index(level=1)

Answer 3

I like to use stacking and unstacking.我喜欢使用堆叠和拆垛。 In this case, it requires that I append the index with 'company' .在这种情况下，它要求我在索引后附加'company' 。

a.set_index('company', append=True).unstack().ffill() \
                                   .stack().reset_index('company')

Timing时机

Conclusion @Psidom's solution works best under both scenarios.结论@Psidom 的解决方案在这两种情况下都效果最好。

toy data玩具数据

bigger toy更大的玩具

np.random.seed([3,1415])
n = 10000
a = pd.DataFrame(np.random.randn(n, 10),
                 pd.date_range('2014-01-01', periods=n, freq='H', name='Time'),
                 pd.Index(list('abcdefghij'), name='company'))

a *= np.random.choice((1, np.nan), (n, 10), p=(.6, .4))

a = a.stack(dropna=False).rename('value').reset_index('company')

Pandas：groupby 向前填充日期时间索引

问题描述

3 个解决方案

解决方案1
14 2016-07-26 18:28:21

解决方案2
9 2016-07-26 18:35:57

解决方案3
5 2016-07-26 18:49:10

Timing时机

Pandas：groupby 向前填充日期时间索引

问题描述

3 个解决方案

解决方案1 14 2016-07-26 18:28:21

解决方案2 9 2016-07-26 18:35:57

解决方案3 5 2016-07-26 18:49:10

Timing时机

解决方案1
14 2016-07-26 18:28:21

解决方案2
9 2016-07-26 18:35:57

解决方案3
5 2016-07-26 18:49:10