简体   繁体   English

Pandas MultiIndex 中的重采样

[英]Resampling Within a Pandas MultiIndex

I have some hierarchical data which bottoms out into time series data which looks something like this:我有一些分层数据,这些数据最终变成时间序列数据,看起来像这样:

df = pandas.DataFrame(
    {'value_a': values_a, 'value_b': values_b},
    index=[states, cities, dates])
df.index.names = ['State', 'City', 'Date']
df

                               value_a  value_b
State   City       Date                        
Georgia Atlanta    2012-01-01        0       10
                   2012-01-02        1       11
                   2012-01-03        2       12
                   2012-01-04        3       13
        Savanna    2012-01-01        4       14
                   2012-01-02        5       15
                   2012-01-03        6       16
                   2012-01-04        7       17
Alabama Mobile     2012-01-01        8       18
                   2012-01-02        9       19
                   2012-01-03       10       20
                   2012-01-04       11       21
        Montgomery 2012-01-01       12       22
                   2012-01-02       13       23
                   2012-01-03       14       24
                   2012-01-04       15       25

I'd like to perform time resampling per city, so something like我想对每个城市进行时间重采样,所以像

df.resample("2D", how="sum")

would output会输出

                             value_a  value_b
State   City       Date                        
Georgia Atlanta    2012-01-01        1       21
                   2012-01-03        5       25
        Savanna    2012-01-01        9       29
                   2012-01-03       13       33
Alabama Mobile     2012-01-01       17       37
                   2012-01-03       21       41
        Montgomery 2012-01-01       25       45
                   2012-01-03       29       49

as is, df.resample('2D', how='sum') gets me原样, df.resample('2D', how='sum')让我

TypeError: Only valid with DatetimeIndex or PeriodIndex

Fair enough, but I'd sort of expect this to work:很公平,但我有点希望这能奏效:

>>> df.swaplevel('Date', 'State').resample('2D', how='sum')
TypeError: Only valid with DatetimeIndex or PeriodIndex

at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?在这一点上我真的没有想法了......有什么方法可以帮助我吗?

pd.Grouper allows you to specify a "groupby instruction for a target object". pd.Grouper允许您指定“目标对象的 groupby 指令”。 In particular, you can use it to group by dates even if df.index is not a DatetimeIndex :特别是,即使df.index不是DatetimeIndex ,您也可以使用它按日期分组:

df.groupby(pd.Grouper(freq='2D', level=-1))

The level=-1 tells pd.Grouper to look for the dates in the last level of the MultiIndex. level=-1告诉pd.Grouper在 MultiIndex 的最后一级中查找日期。 Moreover, you can use this in conjunction with other level values from the index:此外,您可以将其与索引中的其他级别值结合使用:

level_values = df.index.get_level_values
result = (df.groupby([level_values(i) for i in [0,1]]
                      +[pd.Grouper(freq='2D', level=-1)]).sum())

It looks a bit awkward, but using_Grouper turns out to be much faster than my original suggestion, using_reset_index :看起来有点尴尬,但using_Grouper比我原来的建议using_reset_index

import numpy as np
import pandas as pd
import datetime as DT

def using_Grouper(df):
    level_values = df.index.get_level_values
    return (df.groupby([level_values(i) for i in [0,1]]
                       +[pd.Grouper(freq='2D', level=-1)]).sum())

def using_reset_index(df):
    df = df.reset_index(level=[0, 1])
    return df.groupby(['State','City']).resample('2D').sum()

def using_stack(df):
    # http://stackoverflow.com/a/15813787/190597
    return (df.unstack(level=[0,1])
              .resample('2D').sum()
              .stack(level=[2,1])
              .swaplevel(2,0))

def make_orig():
    values_a = range(16)
    values_b = range(10, 26)
    states = ['Georgia']*8 + ['Alabama']*8
    cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
    dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)
    df = pd.DataFrame(
        {'value_a': values_a, 'value_b': values_b},
        index = [states, cities, dates])
    df.index.names = ['State', 'City', 'Date']
    return df

def make_df(N):
    dates = pd.date_range('2000-1-1', periods=N)
    states = np.arange(50)
    cities = np.arange(10)
    index = pd.MultiIndex.from_product([states, cities, dates], 
                                       names=['State', 'City', 'Date'])
    df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,
                      columns=['value_a', 'value_b'])
    return df

df = make_orig()
print(using_Grouper(df))

yields产量

                               value_a  value_b
State   City       Date                        
Alabama Mobile     2012-01-01       17       37
                   2012-01-03       21       41
        Montgomery 2012-01-01       25       45
                   2012-01-03       29       49
Georgia Atlanta    2012-01-01        1       21
                   2012-01-03        5       25
        Savanna    2012-01-01        9       29
                   2012-01-03       13       33

Here is a benchmark comparing using_Grouper , using_reset_index , using_stack on a 5000-row DataFrame:这是在 5000 行 DataFrame 上比较using_Grouperusing_reset_indexusing_stack的基准测试:

In [30]: df = make_df(10)

In [34]: len(df)
Out[34]: 5000

In [32]: %timeit using_Grouper(df)
100 loops, best of 3: 6.03 ms per loop

In [33]: %timeit using_stack(df)
10 loops, best of 3: 22.3 ms per loop

In [31]: %timeit using_reset_index(df)
1 loop, best of 3: 659 ms per loop

You need the groupby() method and provide it with a pd.Grouper for each level of your MultiIndex you wish to maintain in the resulting DataFrame.您需要groupby()方法并为您希望在结果 DataFrame 中维护的pd.Grouper的每个级别提供一个 pd.Grouper。 You can then apply an operation of choice.然后,您可以应用选择的操作。

To resample date or timestamp levels, you need to set the freq argument with the frequency of choice — a similar approach using pd.TimeGrouper() is deprecated in favour of pd.Grouper() with the freq argument set.要重新采样日期或时间戳级别,您需要使用选择的频率设置freq参数 - 不推荐使用使用pd.TimeGrouper()的类似方法,而使用带有freq参数集的pd.Grouper()

This should give you the DataFrame you need:这应该为您提供所需的 DataFrame:

df.groupby([pd.Grouper(level='State'), 
            pd.Grouper(level='City'), 
            pd.Grouper(level='Date', freq='2D')]
          ).sum()

The Time Series Guide in the pandas documentation describes resample() as: pandas 文档中的时间序列指南resample()描述为:

... a time-based groupby, followed by a reduction method on each of its groups. ...基于时间的 groupby,然后是对其每个组的归约方法。

Hence, using groupby() should technically be the same operation as using .resample() on a DataFrame with a single index.因此,从技术上讲,使用groupby()应该与在具有单个索引的 DataFrame 上使用.resample()操作相同。

The same paragraph points to the cookbook section on resampling for more advanced examples, where the ' Grouping using a MultiIndex ' entry is highly relevant for this question.同一段指向关于重采样的食谱部分以获取更高级的示例,其中“ 使用 MultiIndex 进行分组”条目与此问题高度相关。 Hope that helps.希望有帮助。

An alternative using stack/unstack使用堆栈/取消堆栈的替代方法

df.unstack(level=[0,1]).resample('2D', how='sum').stack(level=[2,1]).swaplevel(2,0)

                               value_a  value_b
State   City       Date
Georgia Atlanta    2012-01-01        1       21
Alabama Mobile     2012-01-01       17       37
        Montgomery 2012-01-01       25       45
Georgia Savanna    2012-01-01        9       29
        Atlanta    2012-01-03        5       25
Alabama Mobile     2012-01-03       21       41
        Montgomery 2012-01-03       29       49
Georgia Savanna    2012-01-03       13       33

Notes:笔记:

  1. No idea about performance comparison不知道性能比较
  2. Possible pandas bug - stack(level=[2,1]) worked, but stack(level=[1,2]) failed可能的熊猫错误 - stack(level=[2,1]) 工作,但 stack(level=[1,2]) 失败

This works:这有效:

df.groupby(level=[0,1]).apply(lambda x: x.set_index('Date').resample('2D', how='sum'))

                               value_a  value_b
State   City       Date
Alabama Mobile     2012-01-01       17       37
                   2012-01-03       21       41
        Montgomery 2012-01-01       25       45
                   2012-01-03       29       49
Georgia Atlanta    2012-01-01        1       21
                   2012-01-03        5       25
        Savanna    2012-01-01        9       29
                   2012-01-03       13       33

If the Date column is strings, then convert to datetime beforehand:如果 Date 列是字符串,则预先转换为 datetime:

df['Date'] = pd.to_datetime(df['Date'])

I had the same issue, was breaking my head for a while, but then I read the documentation of the .resample function in the 0.19.2 docs , and I see there's a new kwarg called "level" that you can use to specify a level in a MultiIndex.我有同样的问题,有一段时间让我头晕目眩,但后来我阅读了0.19.2 docs中的.resample函数的文档,我看到有一个名为“level”的新kwarg可以用来指定MultiIndex 中的级别。

Edit: More details in the "What's New" section.编辑: “新增功能”部分中的更多详细信息。

I know this question is a few years old, but I had the same problem and came to a simpler solution that requires 1 line:我知道这个问题已经有几年的历史了,但我遇到了同样的问题,并找到了一个需要 1 行的更简单的解决方案:

>>> import pandas as pd
>>> ts = pd.read_pickle('time_series.pickle')
>>> ts
xxxxxx1  yyyyyyyyyyyyyyyyyyyyyy1  2012-07-01     1
                                  2012-07-02    13
                                  2012-07-03     1
                                  2012-07-04     1
                                  2012-07-05    10
                                  2012-07-06     4
                                  2012-07-07    47
                                  2012-07-08     0
                                  2012-07-09     3
                                  2012-07-10    22
                                  2012-07-11     3
                                  2012-07-12     0
                                  2012-07-13    22
                                  2012-07-14     1
                                  2012-07-15     2
                                  2012-07-16     2
                                  2012-07-17     8
                                  2012-07-18     0
                                  2012-07-19     1
                                  2012-07-20    10
                                  2012-07-21     0
                                  2012-07-22     3
                                  2012-07-23     0
                                  2012-07-24    35
                                  2012-07-25     6
                                  2012-07-26     1
                                  2012-07-27     0
                                  2012-07-28     6
                                  2012-07-29    23
                                  2012-07-30     0
                                                ..
xxxxxxN  yyyyyyyyyyyyyyyyyyyyyyN  2014-06-02     0
                                  2014-06-03     1
                                  2014-06-04     0
                                  2014-06-05     0
                                  2014-06-06     0
                                  2014-06-07     0
                                  2014-06-08     2
                                  2014-06-09     0
                                  2014-06-10     0
                                  2014-06-11     0
                                  2014-06-12     0
                                  2014-06-13     0
                                  2014-06-14     0
                                  2014-06-15     0
                                  2014-06-16     0
                                  2014-06-17     0
                                  2014-06-18     0
                                  2014-06-19     0
                                  2014-06-20     0
                                  2014-06-21     0
                                  2014-06-22     0
                                  2014-06-23     0
                                  2014-06-24     0
                                  2014-06-25     4
                                  2014-06-26     0
                                  2014-06-27     1
                                  2014-06-28     0
                                  2014-06-29     0
                                  2014-06-30     1
                                  2014-07-01     0
dtype: int64
>>> ts.unstack().T.resample('W', how='sum').T.stack()
xxxxxx1  yyyyyyyyyyyyyyyyyyyyyy1  2012-06-25/2012-07-01      1
                                  2012-07-02/2012-07-08     76
                                  2012-07-09/2012-07-15     53
                                  2012-07-16/2012-07-22     24
                                  2012-07-23/2012-07-29     71
                                  2012-07-30/2012-08-05     38
                                  2012-08-06/2012-08-12    258
                                  2012-08-13/2012-08-19    144
                                  2012-08-20/2012-08-26    184
                                  2012-08-27/2012-09-02    323
                                  2012-09-03/2012-09-09    198
                                  2012-09-10/2012-09-16    348
                                  2012-09-17/2012-09-23    404
                                  2012-09-24/2012-09-30    380
                                  2012-10-01/2012-10-07    367
                                  2012-10-08/2012-10-14    163
                                  2012-10-15/2012-10-21    338
                                  2012-10-22/2012-10-28    252
                                  2012-10-29/2012-11-04    197
                                  2012-11-05/2012-11-11    336
                                  2012-11-12/2012-11-18    234
                                  2012-11-19/2012-11-25    143
                                  2012-11-26/2012-12-02    204
                                  2012-12-03/2012-12-09    296
                                  2012-12-10/2012-12-16    146
                                  2012-12-17/2012-12-23     85
                                  2012-12-24/2012-12-30    198
                                  2012-12-31/2013-01-06    214
                                  2013-01-07/2013-01-13    229
                                  2013-01-14/2013-01-20    192
                                                          ...
xxxxxxN  yyyyyyyyyyyyyyyyyyyyyyN  2013-12-09/2013-12-15      3
                                  2013-12-16/2013-12-22      0
                                  2013-12-23/2013-12-29      0
                                  2013-12-30/2014-01-05      1
                                  2014-01-06/2014-01-12      3
                                  2014-01-13/2014-01-19      6
                                  2014-01-20/2014-01-26     11
                                  2014-01-27/2014-02-02      0
                                  2014-02-03/2014-02-09      1
                                  2014-02-10/2014-02-16      4
                                  2014-02-17/2014-02-23      3
                                  2014-02-24/2014-03-02      1
                                  2014-03-03/2014-03-09      4
                                  2014-03-10/2014-03-16      0
                                  2014-03-17/2014-03-23      0
                                  2014-03-24/2014-03-30      9
                                  2014-03-31/2014-04-06      1
                                  2014-04-07/2014-04-13      1
                                  2014-04-14/2014-04-20      1
                                  2014-04-21/2014-04-27      2
                                  2014-04-28/2014-05-04      8
                                  2014-05-05/2014-05-11      7
                                  2014-05-12/2014-05-18      5
                                  2014-05-19/2014-05-25      2
                                  2014-05-26/2014-06-01      8
                                  2014-06-02/2014-06-08      3
                                  2014-06-09/2014-06-15      0
                                  2014-06-16/2014-06-22      0
                                  2014-06-23/2014-06-29      5
                                  2014-06-30/2014-07-06      1
dtype: int64

ts.unstack().T.resample('W', how='sum').T.stack() is all it took! ts.unstack().T.resample('W', how='sum').T.stack()就是这样! Very easy and seems quite performant.非常简单,而且看起来非常高效。 The pickle I'm reading in is 331M, so this is a pretty beefy data structure;我正在阅读的泡菜是 331M,所以这是一个非常强大的数据结构; the resampling takes just a couple seconds on my MacBook Pro.在我的 MacBook Pro 上重新采样只需几秒钟。

I haven't checked the efficiency of this, but my instinctual way of performing datetime operations on a multi-index was by a kind of manual "split-apply-combine" process using a dictionary comprehension.我还没有检查过它的效率,但是我对多索引执行日期时间操作的本能方式是通过一种使用字典理解的手动“拆分-应用-组合”过程。

Assuming your DataFrame is unindexed.假设您的 DataFrame 未编制索引。 (You can do .reset_index() first), this works as follows: (您可以先执行.reset_index() ),其工作原理如下:

  1. Group by the non-date columns按非日期列分组
  2. Set "Date" as index and resample each chunk将“日期”设置为索引并对每个块重新采样
  3. Reassemble using pd.concat使用pd.concat重新组装

The final code looks like:最终代码如下所示:

pd.concat({g: x.set_index("Date").resample("2D").mean()
                   for g, x in house.groupby(["State", "City"])})

I have tried this on my own, pretty short and pretty simple too (I will only work with 2 indexes, and you would get the full idea):我自己试过这个,很短也很简单(我只会使用 2 个索引,你会得到完整的想法):

Step 1 : resample the date but that would give you the date without the other index :第 1 步:重新采样日期,但这会给你没有其他索引的日期:

new=df.reset_index('City').groupby('crime', group_keys=False).resample('2d').sum().pad()

That would give you the date and its count这会给你日期和它的计数

Step 2 : get the categorical index in the same order as the the date :第 2 步:以与日期相同的顺序获取分类索引:

col=df.reset_index('City').groupby('City', group_keys=False).resample('2D').pad()[['City']]

That would give you a new column with the city names and in the same order as the date.这将为您提供一个包含城市名称且与日期顺序相同的新列。

Step 3 : merge the dataframes together第 3 步:将数据框合并在一起

new_df=pd.concat([new, col], axis=1)

It's pretty simple, you can make it really shorter tho.这很简单,你可以让它变得更短。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM