简体   繁体   English

我可以使用 MultiIndex 重新采样(填充)pandas DataFrame

[英]Can I resample (ffill) pandas DataFrame with MultiIndex

I have a dataframe that looks like this, with a MultiIndex over ('timestamp', 'id') :我有一个看起来像这样的数据框,带有一个 MultiIndex over ('timestamp', 'id')

                 value
timestamp   id
2020-03-03  A    100
2020-03-03  B    222
2020-03-03  C    5000
2020-03-04  A    NaN
2020-03-04  B    1
2020-03-04  C    NaN
2020-03-05  A    200
2020-03-05  B    NaN
2020-03-05  C    NaN
2020-03-06  A    NaN
2020-03-06  B    20
2020-03-06  C    NaN

I want to forwards fill (timewise) on value so that the dataframe is populated with the most recently available data item, ie the DataFrame becomes:我想在value上转发填充(按时间),以便数据框填充最新可用的数据项,即数据框变为:

                 value
timestamp   id
2020-03-03  A    100
2020-03-03  B    222
2020-03-03  C    5000
2020-03-04  A    100
2020-03-04  B    1
2020-03-04  C    5000
2020-03-05  A    200
2020-03-05  B    1
2020-03-05  C    5000
2020-03-06  A    200
2020-03-06  B    20
2020-03-06  C    5000

Is there any easy way using resampler?有没有简单的方法使用重采样器?

You can sort the second level and ffill , then reindex like original:您可以对第二级和 ffill 进行排序,然后像原始一样重新索引:

df.sort_index(level=1).ffill().reindex(df.index)

                value
timestamp  id        
2020-03-03 A    100.0
           B    222.0
           C   5000.0
2020-03-04 A    100.0
           B      1.0
           C   5000.0
2020-03-05 A    200.0
           B      1.0
           C   5000.0
2020-03-06 A    200.0
           B     20.0
           C   5000.0

You can also use stack to arrange the data in a correct 2D representation for filling (column-wise) and then unstack back to the original format.您还可以使用stack以正确的 2D 表示形式排列数据以进行填充(按列),然后将stack解压回原始格式。 This treats columns (ie indexes) separately as opposed to rolling over data values, which is the case in the other solution given.这将单独处理列(即索引),而不是滚动数据值,这是给出的其他解决方案中的情况。

a = ['2020-03-03','2020-03-04','2020-03-05', '2020-03-06']
b = ['A', 'B', 'C']
c = ['value1', 'value2']
df = pd.DataFrame(data=None, index=pd.MultiIndex.from_product([a,b]), columns=c)
df.loc[('2020-03-03', slice(None)), 'value1'] = np.array([100, 222, 5000])
df.loc[('2020-03-04', 'B'), 'value1'] = 1.0
df.loc[('2020-03-05', 'A'), 'value1'] = 200.0
df.loc[('2020-03-06', 'C'), 'value1'] = 20
df['value2'] = df['value1']
df.loc[('2020-03-03', 'C'), 'value2'] = np.nan
df

                 value1  value2
timestamp   id
2020-03-03  A    100     100
2020-03-03  B    222     222
2020-03-03  C    5000    NaN   # <- OBS!
2020-03-04  A    NaN     NaN
2020-03-04  B    1       1
2020-03-04  C    NaN     NaN
2020-03-05  A    200     200
2020-03-05  B    NaN     NaN
2020-03-05  C    NaN     NaN
2020-03-06  A    NaN     NaN
2020-03-06  B    20      20
2020-03-06  C    NaN     NaN

Using df.unstack().fillna(method='ffill') gives使用df.unstack().fillna(method='ffill')给出

            value1             value2
            A     B     C      A     B     C
timestamp
2020-03-03  100   222  5000    100   222   NaN
2020-03-04  100   1    5000    100   1     NaN
2020-03-05  200   1    5000    200   1     NaN
2020-03-06  200   1    20      200   1     20

This can be reverted with .stack() to the original format again.这可以通过.stack()再次恢复到原始格式。

Comparing this to df.sort_index(level=1).ffill().reindex(df.index) the difference is in the last column where since 'C' start with an NaN the value from 'B' of 1 is rolled into the start of 'C' for 'Value2'.将此与df.sort_index(level=1).ffill().reindex(df.index) ,不同之处在于最后一列,因为 'C' 以NaN开头,因此 'B' 的值为 1 的值被滚动到'Value2' 的 'C' 开头。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM