简体   繁体   English

通过任意因子重新采样熊猫数据框

[英]Resample a pandas dataframe by an arbitrary factor

Pandas resampling is really convenient if your indices use datetime indexing, but I haven't found an easy implementation to resample by an arbitrary factor.如果您的索引使用日期时间索引,Pandas 重采样真的很方便,但我还没有找到一个简单的实现来通过任意因子重新采样。 Eg, just treat each index as an arbitrary index, and resample the dataframe so that its resulting length is 4X shorter (and being more intelligent about it than just taking every 4th datapoint).例如,只需将每个索引视为一个任意索引,并对数据帧重新采样,使其结果长度缩短 4 倍(并且比每 4 个数据点更智能)。

This would be useful for anyone that's working with data that operates on a much shorter timescale than datetimes.这对于处理在比日期时间短得多的时间尺度上运行的数据的任何人都非常有用。 For example, in my case I want to resample an audio vector from 44KHz to 11KHz.例如,在我的情况下,我想将音频向量从 44KHz 重新采样到 11KHz。 Right now I have to use scipy's "decimate" function, and then re-convert it back to a dataframe (using dataframe.apply wasn't working because it changes the length of the dataframe).现在我必须使用 scipy 的“抽取”函数,然后将其重新转换回数据帧(使用 dataframe.apply 不起作用,因为它改变了数据帧的长度)。

Anyone have any suggestions for how to accomplish this?有人对如何实现这一目标有任何建议吗?

You can use DatetimeIndex to resample high frequency data (up to nanosecond precision, caveat: I believe this is only available in the upcoming 0.13 release).您可以使用DatetimeIndex对高频数据重新采样(高达纳秒精度,警告:我相信这仅在即将发布的 0.13 版本中可用)。 I've successfully used pandas to resample electrophysiological data in the 24KHz range.我已经成功地使用 pandas 在 24KHz 范围内重新采样电生理数据。 Here's an example:下面是一个例子:

In [97]: index = date_range('1/1/2001 00:00:00', '1/1/2001 00:00:01', freq='22727N')

In [98]: index
Out[98]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2001-01-01 00:00:00, ..., 2001-01-01 00:00:00.999988]
Length: 44001, Freq: 22727N, Timezone: None

In [99]: s = Series(randn(index.size), index=index)

In [100]: s.head(10)
Out[100]:
2001-01-01 00:00:00          -0.820
2001-01-01 00:00:00.000022   -1.141
2001-01-01 00:00:00.000045    1.577
2001-01-01 00:00:00.000068   -1.031
2001-01-01 00:00:00.000090    0.343
2001-01-01 00:00:00.000113   -0.424
2001-01-01 00:00:00.000136   -0.753
2001-01-01 00:00:00.000159    0.411
2001-01-01 00:00:00.000181    0.238
2001-01-01 00:00:00.000204    1.048
Freq: 22727N, dtype: float64

In [101]: s.resample(s.index.freq * 4, how='mean')
Out[101]:
2001-01-01 00:00:00          -0.354
2001-01-01 00:00:00.000090   -0.106
2001-01-01 00:00:00.000181    0.245
2001-01-01 00:00:00.000272    0.568
2001-01-01 00:00:00.000363    0.047
2001-01-01 00:00:00.000454   -0.560
2001-01-01 00:00:00.000545   -0.485
2001-01-01 00:00:00.000636   -0.271
2001-01-01 00:00:00.000727   -0.457
2001-01-01 00:00:00.000818    0.078
2001-01-01 00:00:00.000909    0.394
2001-01-01 00:00:00.000999    0.185
2001-01-01 00:00:00.001090   -0.441
2001-01-01 00:00:00.001181    0.300
2001-01-01 00:00:00.001272   -0.521
...
2001-01-01 00:00:00.998715   -0.045
2001-01-01 00:00:00.998806   -0.044
2001-01-01 00:00:00.998897    0.090
2001-01-01 00:00:00.998988    0.748
2001-01-01 00:00:00.999078   -0.179
2001-01-01 00:00:00.999169    0.451
2001-01-01 00:00:00.999260   -1.041
2001-01-01 00:00:00.999351   -0.476
2001-01-01 00:00:00.999442   -0.234
2001-01-01 00:00:00.999533   -0.719
2001-01-01 00:00:00.999624   -0.606
2001-01-01 00:00:00.999715   -0.032
2001-01-01 00:00:00.999806   -0.296
2001-01-01 00:00:00.999897   -0.044
2001-01-01 00:00:00.999988   -0.951
Freq: 90908N, Length: 11001

You can pass in a callable to how , which would allow you to "do something more intelligent".您可以将可调用对象传递给how ,这将允许您“做一些更智能的事情”。 pandas defaults to taking the average over the period given (in this case, that's the average over each chunk of 22727 samples). pandas默认取给定时间段内的平均值(在这种情况下,这是每个 22727 个样本块的平均值)。

I have a dirty yet effective answer to propose :我有一个肮脏但有效的答案要提出:

first duplicate your index column in an other colum like this if your dataframe is called data :如果您的数据框称为 data ,请首先在其他列中复制您的索引列:

for i in data.index:
    data.at[data.index[i],'num']=i

then simply resample using panda's ability for complex selection :然后简单地使用熊猫的复杂选择能力重新采样:

data_resampled = data[data['num']%frequency==0]

It might be possible to do this without copying the index colum or most probably a dedicated command exists to make this more elegant.可以在不复制索引列的情况下执行此操作,或者很可能存在专用命令以使其更优雅。 Yet, this works.然而,这有效。

OK, here is a maybe more pythonic way, in one line for a non datetime index :好的,这里有一种可能更 Pythonic 的方式,在一行中用于非日期时间索引:

data_resampled = data.reset_index()[data.reset_index()['index']%frequency==0]

this way you spare the for loop and you get an 'index' column that you can discard afterward if needed.通过这种方式,您可以省去 for 循环,并获得一个“索引”列,您可以在需要时将其丢弃。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM