[英]Resampling Within a Pandas MultiIndex
I have some hierarchical data which bottoms out into time series data which looks something like this:我有一些分层数据,这些数据最终变成时间序列数据,看起来像这样:
df = pandas.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index=[states, cities, dates])
df.index.names = ['State', 'City', 'Date']
df
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 0 10
2012-01-02 1 11
2012-01-03 2 12
2012-01-04 3 13
Savanna 2012-01-01 4 14
2012-01-02 5 15
2012-01-03 6 16
2012-01-04 7 17
Alabama Mobile 2012-01-01 8 18
2012-01-02 9 19
2012-01-03 10 20
2012-01-04 11 21
Montgomery 2012-01-01 12 22
2012-01-02 13 23
2012-01-03 14 24
2012-01-04 15 25
I'd like to perform time resampling per city, so something like我想对每个城市进行时间重采样,所以像
df.resample("2D", how="sum")
would output会输出
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
as is, df.resample('2D', how='sum')
gets me原样, df.resample('2D', how='sum')
让我
TypeError: Only valid with DatetimeIndex or PeriodIndex
Fair enough, but I'd sort of expect this to work:很公平,但我有点希望这能奏效:
>>> df.swaplevel('Date', 'State').resample('2D', how='sum')
TypeError: Only valid with DatetimeIndex or PeriodIndex
at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?在这一点上我真的没有想法了......有什么方法可以帮助我吗?
pd.Grouper
allows you to specify a "groupby instruction for a target object". pd.Grouper
允许您指定“目标对象的 groupby 指令”。 In particular, you can use it to group by dates even if df.index
is not a DatetimeIndex
:特别是,即使df.index
不是DatetimeIndex
,您也可以使用它按日期分组:
df.groupby(pd.Grouper(freq='2D', level=-1))
The level=-1
tells pd.Grouper
to look for the dates in the last level of the MultiIndex. level=-1
告诉pd.Grouper
在 MultiIndex 的最后一级中查找日期。 Moreover, you can use this in conjunction with other level values from the index:此外,您可以将其与索引中的其他级别值结合使用:
level_values = df.index.get_level_values
result = (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
It looks a bit awkward, but using_Grouper
turns out to be much faster than my original suggestion, using_reset_index
:看起来有点尴尬,但using_Grouper
比我原来的建议using_reset_index
:
import numpy as np
import pandas as pd
import datetime as DT
def using_Grouper(df):
level_values = df.index.get_level_values
return (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
def using_reset_index(df):
df = df.reset_index(level=[0, 1])
return df.groupby(['State','City']).resample('2D').sum()
def using_stack(df):
# http://stackoverflow.com/a/15813787/190597
return (df.unstack(level=[0,1])
.resample('2D').sum()
.stack(level=[2,1])
.swaplevel(2,0))
def make_orig():
values_a = range(16)
values_b = range(10, 26)
states = ['Georgia']*8 + ['Alabama']*8
cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)
df = pd.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index = [states, cities, dates])
df.index.names = ['State', 'City', 'Date']
return df
def make_df(N):
dates = pd.date_range('2000-1-1', periods=N)
states = np.arange(50)
cities = np.arange(10)
index = pd.MultiIndex.from_product([states, cities, dates],
names=['State', 'City', 'Date'])
df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,
columns=['value_a', 'value_b'])
return df
df = make_orig()
print(using_Grouper(df))
yields产量
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Here is a benchmark comparing using_Grouper
, using_reset_index
, using_stack
on a 5000-row DataFrame:这是在 5000 行 DataFrame 上比较using_Grouper
、 using_reset_index
和using_stack
的基准测试:
In [30]: df = make_df(10)
In [34]: len(df)
Out[34]: 5000
In [32]: %timeit using_Grouper(df)
100 loops, best of 3: 6.03 ms per loop
In [33]: %timeit using_stack(df)
10 loops, best of 3: 22.3 ms per loop
In [31]: %timeit using_reset_index(df)
1 loop, best of 3: 659 ms per loop
You need the groupby()
method and provide it with a pd.Grouper
for each level of your MultiIndex you wish to maintain in the resulting DataFrame.您需要groupby()
方法并为您希望在结果 DataFrame 中维护的pd.Grouper
的每个级别提供一个 pd.Grouper。 You can then apply an operation of choice.然后,您可以应用选择的操作。
To resample date or timestamp levels, you need to set the freq
argument with the frequency of choice — a similar approach using pd.TimeGrouper()
is deprecated in favour of pd.Grouper()
with the freq
argument set.要重新采样日期或时间戳级别,您需要使用选择的频率设置freq
参数 - 不推荐使用使用pd.TimeGrouper()
的类似方法,而使用带有freq
参数集的pd.Grouper()
。
This should give you the DataFrame you need:这应该为您提供所需的 DataFrame:
df.groupby([pd.Grouper(level='State'),
pd.Grouper(level='City'),
pd.Grouper(level='Date', freq='2D')]
).sum()
The Time Series Guide in the pandas documentation describes resample()
as: pandas 文档中的时间序列指南将resample()
描述为:
... a time-based groupby, followed by a reduction method on each of its groups. ...基于时间的 groupby,然后是对其每个组的归约方法。
Hence, using groupby()
should technically be the same operation as using .resample()
on a DataFrame with a single index.因此,从技术上讲,使用groupby()
应该与在具有单个索引的 DataFrame 上使用.resample()
操作相同。
The same paragraph points to the cookbook section on resampling for more advanced examples, where the ' Grouping using a MultiIndex ' entry is highly relevant for this question.同一段指向关于重采样的食谱部分以获取更高级的示例,其中“ 使用 MultiIndex 进行分组”条目与此问题高度相关。 Hope that helps.希望有帮助。
An alternative using stack/unstack使用堆栈/取消堆栈的替代方法
df.unstack(level=[0,1]).resample('2D', how='sum').stack(level=[2,1]).swaplevel(2,0)
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
Alabama Mobile 2012-01-01 17 37
Montgomery 2012-01-01 25 45
Georgia Savanna 2012-01-01 9 29
Atlanta 2012-01-03 5 25
Alabama Mobile 2012-01-03 21 41
Montgomery 2012-01-03 29 49
Georgia Savanna 2012-01-03 13 33
Notes:笔记:
This works:这有效:
df.groupby(level=[0,1]).apply(lambda x: x.set_index('Date').resample('2D', how='sum'))
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
If the Date column is strings, then convert to datetime beforehand:如果 Date 列是字符串,则预先转换为 datetime:
df['Date'] = pd.to_datetime(df['Date'])
I had the same issue, was breaking my head for a while, but then I read the documentation of the .resample
function in the 0.19.2 docs , and I see there's a new kwarg
called "level" that you can use to specify a level in a MultiIndex.我有同样的问题,有一段时间让我头晕目眩,但后来我阅读了0.19.2 docs中的.resample
函数的文档,我看到有一个名为“level”的新kwarg
可以用来指定MultiIndex 中的级别。
Edit: More details in the "What's New" section.编辑: “新增功能”部分中的更多详细信息。
I know this question is a few years old, but I had the same problem and came to a simpler solution that requires 1 line:我知道这个问题已经有几年的历史了,但我遇到了同样的问题,并找到了一个需要 1 行的更简单的解决方案:
>>> import pandas as pd
>>> ts = pd.read_pickle('time_series.pickle')
>>> ts
xxxxxx1 yyyyyyyyyyyyyyyyyyyyyy1 2012-07-01 1
2012-07-02 13
2012-07-03 1
2012-07-04 1
2012-07-05 10
2012-07-06 4
2012-07-07 47
2012-07-08 0
2012-07-09 3
2012-07-10 22
2012-07-11 3
2012-07-12 0
2012-07-13 22
2012-07-14 1
2012-07-15 2
2012-07-16 2
2012-07-17 8
2012-07-18 0
2012-07-19 1
2012-07-20 10
2012-07-21 0
2012-07-22 3
2012-07-23 0
2012-07-24 35
2012-07-25 6
2012-07-26 1
2012-07-27 0
2012-07-28 6
2012-07-29 23
2012-07-30 0
..
xxxxxxN yyyyyyyyyyyyyyyyyyyyyyN 2014-06-02 0
2014-06-03 1
2014-06-04 0
2014-06-05 0
2014-06-06 0
2014-06-07 0
2014-06-08 2
2014-06-09 0
2014-06-10 0
2014-06-11 0
2014-06-12 0
2014-06-13 0
2014-06-14 0
2014-06-15 0
2014-06-16 0
2014-06-17 0
2014-06-18 0
2014-06-19 0
2014-06-20 0
2014-06-21 0
2014-06-22 0
2014-06-23 0
2014-06-24 0
2014-06-25 4
2014-06-26 0
2014-06-27 1
2014-06-28 0
2014-06-29 0
2014-06-30 1
2014-07-01 0
dtype: int64
>>> ts.unstack().T.resample('W', how='sum').T.stack()
xxxxxx1 yyyyyyyyyyyyyyyyyyyyyy1 2012-06-25/2012-07-01 1
2012-07-02/2012-07-08 76
2012-07-09/2012-07-15 53
2012-07-16/2012-07-22 24
2012-07-23/2012-07-29 71
2012-07-30/2012-08-05 38
2012-08-06/2012-08-12 258
2012-08-13/2012-08-19 144
2012-08-20/2012-08-26 184
2012-08-27/2012-09-02 323
2012-09-03/2012-09-09 198
2012-09-10/2012-09-16 348
2012-09-17/2012-09-23 404
2012-09-24/2012-09-30 380
2012-10-01/2012-10-07 367
2012-10-08/2012-10-14 163
2012-10-15/2012-10-21 338
2012-10-22/2012-10-28 252
2012-10-29/2012-11-04 197
2012-11-05/2012-11-11 336
2012-11-12/2012-11-18 234
2012-11-19/2012-11-25 143
2012-11-26/2012-12-02 204
2012-12-03/2012-12-09 296
2012-12-10/2012-12-16 146
2012-12-17/2012-12-23 85
2012-12-24/2012-12-30 198
2012-12-31/2013-01-06 214
2013-01-07/2013-01-13 229
2013-01-14/2013-01-20 192
...
xxxxxxN yyyyyyyyyyyyyyyyyyyyyyN 2013-12-09/2013-12-15 3
2013-12-16/2013-12-22 0
2013-12-23/2013-12-29 0
2013-12-30/2014-01-05 1
2014-01-06/2014-01-12 3
2014-01-13/2014-01-19 6
2014-01-20/2014-01-26 11
2014-01-27/2014-02-02 0
2014-02-03/2014-02-09 1
2014-02-10/2014-02-16 4
2014-02-17/2014-02-23 3
2014-02-24/2014-03-02 1
2014-03-03/2014-03-09 4
2014-03-10/2014-03-16 0
2014-03-17/2014-03-23 0
2014-03-24/2014-03-30 9
2014-03-31/2014-04-06 1
2014-04-07/2014-04-13 1
2014-04-14/2014-04-20 1
2014-04-21/2014-04-27 2
2014-04-28/2014-05-04 8
2014-05-05/2014-05-11 7
2014-05-12/2014-05-18 5
2014-05-19/2014-05-25 2
2014-05-26/2014-06-01 8
2014-06-02/2014-06-08 3
2014-06-09/2014-06-15 0
2014-06-16/2014-06-22 0
2014-06-23/2014-06-29 5
2014-06-30/2014-07-06 1
dtype: int64
ts.unstack().T.resample('W', how='sum').T.stack()
is all it took! ts.unstack().T.resample('W', how='sum').T.stack()
就是这样! Very easy and seems quite performant.非常简单,而且看起来非常高效。 The pickle I'm reading in is 331M, so this is a pretty beefy data structure;我正在阅读的泡菜是 331M,所以这是一个非常强大的数据结构; the resampling takes just a couple seconds on my MacBook Pro.在我的 MacBook Pro 上重新采样只需几秒钟。
I haven't checked the efficiency of this, but my instinctual way of performing datetime operations on a multi-index was by a kind of manual "split-apply-combine" process using a dictionary comprehension.我还没有检查过它的效率,但是我对多索引执行日期时间操作的本能方式是通过一种使用字典理解的手动“拆分-应用-组合”过程。
Assuming your DataFrame is unindexed.假设您的 DataFrame 未编制索引。 (You can do .reset_index()
first), this works as follows: (您可以先执行.reset_index()
),其工作原理如下:
pd.concat
使用pd.concat
重新组装The final code looks like:最终代码如下所示:
pd.concat({g: x.set_index("Date").resample("2D").mean()
for g, x in house.groupby(["State", "City"])})
I have tried this on my own, pretty short and pretty simple too (I will only work with 2 indexes, and you would get the full idea):我自己试过这个,很短也很简单(我只会使用 2 个索引,你会得到完整的想法):
Step 1 : resample the date but that would give you the date without the other index :第 1 步:重新采样日期,但这会给你没有其他索引的日期:
new=df.reset_index('City').groupby('crime', group_keys=False).resample('2d').sum().pad()
That would give you the date and its count这会给你日期和它的计数
Step 2 : get the categorical index in the same order as the the date :第 2 步:以与日期相同的顺序获取分类索引:
col=df.reset_index('City').groupby('City', group_keys=False).resample('2D').pad()[['City']]
That would give you a new column with the city names and in the same order as the date.这将为您提供一个包含城市名称且与日期顺序相同的新列。
Step 3 : merge the dataframes together第 3 步:将数据框合并在一起
new_df=pd.concat([new, col], axis=1)
It's pretty simple, you can make it really shorter tho.这很简单,你可以让它变得更短。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.