[英]Can I resample (ffill) pandas DataFrame with MultiIndex
I have a dataframe that looks like this, with a MultiIndex over ('timestamp', 'id')
:我有一个看起来像这样的数据框,带有一个 MultiIndex over
('timestamp', 'id')
:
value
timestamp id
2020-03-03 A 100
2020-03-03 B 222
2020-03-03 C 5000
2020-03-04 A NaN
2020-03-04 B 1
2020-03-04 C NaN
2020-03-05 A 200
2020-03-05 B NaN
2020-03-05 C NaN
2020-03-06 A NaN
2020-03-06 B 20
2020-03-06 C NaN
I want to forwards fill (timewise) on value
so that the dataframe is populated with the most recently available data item, ie the DataFrame becomes:我想在
value
上转发填充(按时间),以便数据框填充最新可用的数据项,即数据框变为:
value
timestamp id
2020-03-03 A 100
2020-03-03 B 222
2020-03-03 C 5000
2020-03-04 A 100
2020-03-04 B 1
2020-03-04 C 5000
2020-03-05 A 200
2020-03-05 B 1
2020-03-05 C 5000
2020-03-06 A 200
2020-03-06 B 20
2020-03-06 C 5000
Is there any easy way using resampler?有没有简单的方法使用重采样器?
You can sort the second level and ffill , then reindex like original:您可以对第二级和 ffill 进行排序,然后像原始一样重新索引:
df.sort_index(level=1).ffill().reindex(df.index)
value
timestamp id
2020-03-03 A 100.0
B 222.0
C 5000.0
2020-03-04 A 100.0
B 1.0
C 5000.0
2020-03-05 A 200.0
B 1.0
C 5000.0
2020-03-06 A 200.0
B 20.0
C 5000.0
You can also use stack
to arrange the data in a correct 2D representation for filling (column-wise) and then unstack back to the original format.您还可以使用
stack
以正确的 2D 表示形式排列数据以进行填充(按列),然后将stack
解压回原始格式。 This treats columns (ie indexes) separately as opposed to rolling over data values, which is the case in the other solution given.这将单独处理列(即索引),而不是滚动数据值,这是给出的其他解决方案中的情况。
a = ['2020-03-03','2020-03-04','2020-03-05', '2020-03-06']
b = ['A', 'B', 'C']
c = ['value1', 'value2']
df = pd.DataFrame(data=None, index=pd.MultiIndex.from_product([a,b]), columns=c)
df.loc[('2020-03-03', slice(None)), 'value1'] = np.array([100, 222, 5000])
df.loc[('2020-03-04', 'B'), 'value1'] = 1.0
df.loc[('2020-03-05', 'A'), 'value1'] = 200.0
df.loc[('2020-03-06', 'C'), 'value1'] = 20
df['value2'] = df['value1']
df.loc[('2020-03-03', 'C'), 'value2'] = np.nan
df
value1 value2
timestamp id
2020-03-03 A 100 100
2020-03-03 B 222 222
2020-03-03 C 5000 NaN # <- OBS!
2020-03-04 A NaN NaN
2020-03-04 B 1 1
2020-03-04 C NaN NaN
2020-03-05 A 200 200
2020-03-05 B NaN NaN
2020-03-05 C NaN NaN
2020-03-06 A NaN NaN
2020-03-06 B 20 20
2020-03-06 C NaN NaN
Using df.unstack().fillna(method='ffill')
gives使用
df.unstack().fillna(method='ffill')
给出
value1 value2
A B C A B C
timestamp
2020-03-03 100 222 5000 100 222 NaN
2020-03-04 100 1 5000 100 1 NaN
2020-03-05 200 1 5000 200 1 NaN
2020-03-06 200 1 20 200 1 20
This can be reverted with .stack()
to the original format again.这可以通过
.stack()
再次恢复到原始格式。
Comparing this to df.sort_index(level=1).ffill().reindex(df.index)
the difference is in the last column where since 'C' start with an NaN
the value from 'B' of 1 is rolled into the start of 'C' for 'Value2'.将此与
df.sort_index(level=1).ffill().reindex(df.index)
,不同之处在于最后一列,因为 'C' 以NaN
开头,因此 'B' 的值为 1 的值被滚动到'Value2' 的 'C' 开头。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.