简体   繁体   English

如何对混合类型的 Pandas 数据框进行重新采样?

[英]How to resample a Pandas dataframe of mixed type?

I generate a mixed type (floats and strings) Pandas DataFrame df3 with the following Python code:我使用以下 Python 代码生成混合类型(浮点数和字符串)Pandas DataFrame df3:

df1 = pd.DataFrame(np.random.randn(dates.shape[0],2),index=dates,columns=list('AB'))
df1['C'] = 'A'
df1['D'] = 'Pickles'
df2 = pd.DataFrame(np.random.randn(dates.shape[0], 2),index=dates,columns=list('AB'))
df2['C'] = 'B'
df2['D'] = 'Ham'
df3 = pd.concat([df1, df2], axis=0)

When I resample df3 to a higher frequency I don't get the frame resampled to a higher rate but the how is ignored and I just get missing values:当我将 df3 重新采样到更高的频率时,我没有将帧重新采样到更高的速率,但是 how 被忽略了,我只是得到了缺失值:

df4 = df3.groupby(['C']).resample('M',  how={'A': 'mean', 'B': 'mean',  'D': 'ffill'})
df4.head()

Result:结果:

                      B          A        D
C                                          
A 2014-03-31 -0.4640906 -0.2435414  Pickles
  2014-04-30        NaN        NaN      NaN
  2014-05-31        NaN        NaN      NaN
  2014-06-30 -0.5626360  0.6679614  Pickles
  2014-07-31        NaN        NaN      NaN

When I resample df3 to a lower frequency I don't get any resampling at all:当我将 df3 重新采样到较低频率时,我根本没有得到任何重新采样:

df5 = df3.groupby(['C']).resample('A',  how={'A': np.mean, 'B': np.mean,  'D': 'ffill'})
df5.head()

Result:结果:

                      B          A        D
C                                          
A 2014-03-31        NaN        NaN  Pickles
  2014-06-30        NaN        NaN  Pickles
  2014-09-30        NaN        NaN  Pickles
  2014-12-31 -0.7429617 -0.1065645  Pickles
  2015-03-31        NaN        NaN  Pickles

I'm pretty sure that this has something to do with the mixed types because if I redo the annual down-sampling with just numerical columns everything works as expected:我很确定这与混合类型有关,因为如果我仅使用数字列重做年度下采样,一切都会按预期工作:

df5b = df3[['A', 'B', 'C']].groupby(['C']).resample('A',  how={'A': np.mean, 'B': np.mean})
df5b.head()

Result:结果:

                     B          A
  C                                 
  A 2014-12-31 -0.7429617 -0.1065645
    2015-12-31 -0.6245030 -0.3101057
  B 2014-12-31  0.4213621 -0.0708263
    2015-12-31 -0.0607028  0.0110456

But even when I switch to numerical types the resampling to higher frequency still doesn't work as I expected:但即使我切换到数字类型,对更高频率的重采样仍然无法按我预期的那样工作:

df4b = df3[['A', 'B', 'C']].groupby(['C']).resample('M',  how={'A': 'mean', 'B': 'mean'})
df4b.head()

Results:结果:

                      B          A
C                                 
A 2014-03-31 -0.4640906 -0.2435414
  2014-04-30        NaN        NaN
  2014-05-31        NaN        NaN
  2014-06-30 -0.5626360  0.6679614
  2014-07-31        NaN        NaN

Which leaves me with two questions:这给我留下了两个问题:

  1. What is the proper way to resample a dataframe of mixed type?对混合类型的数据帧重新采样的正确方法是什么?
  2. When resampling from a lower frequency to a higher frequency what is the proper way to do the resampling so that the new values are interpolated?当从较低频率重采样到较高频率时,进行重采样以便插入新值的正确方法是什么?

Even if you can't provide a full answer to both parts a partial solution or an answer to either question is appreciated.即使您不能对这两个部分都提供完整的答案,也可以提供部分解决方案或对任一问题的答案。

When resampling from a lower frequency to a higher frequency I realized that I was specifying the how when I wanted to specify the fill_method .当从较低频率重新采样到较高频率时,我意识到我在指定fill_method时指定了how When I do so things seem to work.当我这样做时,事情似乎有效。

df4c = df3.groupby(['C']).resample('M',  fill_method='ffill')
df4c.head()
                     A          B        D
C                                          
A 2014-03-31 -0.2435414 -0.4640906  Pickles
  2014-04-30 -0.2435414 -0.4640906  Pickles
  2014-05-31 -0.2435414 -0.4640906  Pickles
  2014-06-30  0.6679614 -0.5626360  Pickles
  2014-07-31  0.6679614 -0.5626360  Pickles

You get a much more limited set of interpolation choices but it does handle the mixed types.您获得的插值选择集要有限得多,但它确实可以处理混合类型。

When resampling to a lower frequency using no how option (I believe it defaults to mean) the down-sampling does work:当使用 no how选项(我相信它的默认含义)重新采样到较低频率时,下采样确实有效:

   df5c =df3.groupby(['C']).resample('A')
   df5c.head()
                  A          B
C                                 
A 2014-12-31 -0.1065645 -0.7429617
  2015-12-31 -0.3101057 -0.6245030
B 2014-12-31 -0.0708263  0.4213621
  2015-12-31  0.0110456 -0.0607028

Therefore it seems the problem seems to be with passing a dictionary of how options or one of the option choices, presumably ffill , but I'm not sure.因此,问题似乎在传递选项字典或其中一个选项的字典上,大概是ffill ,但我不确定。

Use resample and agg使用resampleagg

Since pandas-1.0.0 , the how and fill_method keywords no longer exist .pandas-1.0.0开始, howfill_method关键字不再存在 Besides, the resample method now returns a Resampler object .此外, resample方法现在返回一个Resampler对象

The solution is to define an aggregation rule using functions or function names associated to each column.解决方案是使用与每一列关联的函数或函数名称来定义聚合规则。

df.resample(period).agg(aggregation_rule)

More examples on aggregation rules in the documentation . 文档中有关聚合规则的更多示例。

Working example工作示例

Prepare test data:准备测试数据:

import numpy as np
import pandas as pd

dates = pd.date_range("2021-02-09", "2021-04-09", freq="1D")
df1 = pd.DataFrame(np.random.randn(dates.shape[0],2), index=dates, columns=list('AB'))
df1['C'] = 'A'
df1['D'] = 'Pickles'
df2 = pd.DataFrame(np.random.randn(dates.shape[0], 2), index=dates, columns=list('AB'))
df2['C'] = 'B'
df2['D'] = 'Ham'
df3 = pd.concat([df1, df2], axis=0)
print(df3)

Output:输出:

                   A         B  C        D
2021-02-09  2.591285  2.455686  A  Pickles
2021-02-10  0.753461 -0.072643  A  Pickles
2021-02-11 -0.351667 -0.025511  A  Pickles
2021-02-12 -0.896730  0.004512  A  Pickles
2021-02-13 -0.493139 -0.770514  A  Pickles
...              ...       ... ..      ...
2021-04-05  1.615935  1.152517  B      Ham
2021-04-06 -0.067654 -0.858186  B      Ham
2021-04-07  0.085587 -0.848542  B      Ham
2021-04-08 -0.371983  0.088441  B      Ham
2021-04-09  0.681501  0.235328  B      Ham

[120 rows x 4 columns]

Resample per month:每月重新取样:

agg_rules = { "A": "mean", "B": "sum", "C": "first", "D": "last",}
df4 = df3.resample("M").agg(agg_rules)
print(df4)

Output:输出:

                   A         B  C    D
2021-02-28  0.025987  3.886781  A  Ham
2021-03-31  0.081423 -5.492928  A  Ham
2021-04-30  0.239309 -3.344334  A  Ham

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM