简体   繁体   English

Pandas相当于整数索引的重采样

[英]Pandas' equivalent of resample for integer index

I'm looking for a pandas equivalent of the resample method for a dataframe whose isn't a DatetimeIndex but an array of integers, or maybe even floats. 我正在寻找一个pandas等效的resample方法,用于数据帧,它不是DatetimeIndex而是整数数组,甚至可能是浮点数。

I know that for some cases ( this one , for example) the resample method can be substituted easily by a reindex and interpolation, but for some cases (I think) it can't. 我知道,对于某些情况(例如, 这个 ),重新采样方法可以通过reindex和插值轻松替换,但在某些情况下(我认为)它不能。

For example, if I have 例如,如果我有

df = pd.DataFrame(np.random.randn(10,2))
withdates = df.set_index(pd.date_range('2012-01-01', periods=10))
withdates.resample('5D', np.std)

this gives me 这给了我

                   0         1
2012-01-01  1.184582  0.492113
2012-01-06  0.533134  0.982562

but I can't produce the same result with df and resample. 但我不能用df和resample产生相同的结果。 So I'm looking for something that would work as 所以我正在寻找可以起作用的东西

 df.resample(5, np.std)

and that would give me 这会给我

          0         1
0  1.184582  0.492113
5  0.533134  0.982562

Does such a method exist? 这种方法存在吗? The only way I was able to create this method was by manually separating df into smaller dataframes, applying np.std and then concatenating everything back, which I find pretty slow and not smart at all. 我能够创建此方法的唯一方法是手动将df分成较小的数据帧,应用np.std然后将所有内容连接起来,我觉得这很慢,而且根本不聪明。

Cheers 干杯

Setup 建立

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

You need to create the labels to group by yourself. 您需要创建标签以自行分组。 I'd use: 我用的是:

(df.index.to_series() / 5).astype(int)

To get you a series of values like [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...] Then use this in a groupby 为了获得一系列的值,如[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...] groupby [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...]然后在groupby使用它

You'll also need to specify the index for the new dataframe. 您还需要为新数据帧指定索引。 I'd use: 我用的是:

df.index[4::5]

To get a the current index starting at the 5th position (hence the 4 ) and every 5th position after that. 获得当前指数从第5个位置开始(因此是4 )和之后的每个第5个位置。 It will look like [4, 9, 14, 19] . 它看起来像[4, 9, 14, 19] I could've done this as df.index[::5] to get the starting positions but I went with ending positions. 我可以用df.index[::5]来完成这个以获得起始位置,但我选择了结束位置。

Solution

# assign as variable because I'm going to use it more than once.
s = (df.index.to_series() / 5).astype(int)

df.groupby(s).std().set_index(s.index[4::5])

Looks like: 好像:

           A         B
4   0.198019  0.320451
9   0.329750  0.408232
14  0.293297  0.223991
19  0.095633  0.376390

Other considerations 其他考虑

This is for the equivalent of down sampling. 这相当于下采样。 We haven't addressed up sampling. 我们还没有解决抽样问题。

To go back from what we've produced to a dataframe index by something more frequent, we can use reindex like so: 为了更频繁地从我们生成的数据帧索引返回到数据框索引,我们可以像这样使用reindex

# assign what we've done above to df_down
df_down = df.groupby(s).std().set_index(s.index[4::5])

df_up = df_down.reindex(range(20)).bfill()

Looks like: 好像:

           A         B
0   0.198019  0.320451
1   0.198019  0.320451
2   0.198019  0.320451
3   0.198019  0.320451
4   0.198019  0.320451
5   0.329750  0.408232
6   0.329750  0.408232
7   0.329750  0.408232
8   0.329750  0.408232
9   0.329750  0.408232
10  0.293297  0.223991
11  0.293297  0.223991
12  0.293297  0.223991
13  0.293297  0.223991
14  0.293297  0.223991
15  0.095633  0.376390
16  0.095633  0.376390
17  0.095633  0.376390
18  0.095633  0.376390
19  0.095633  0.376390

We could also use other things to reindex by like range(0, 20, 2) to up sample to even integer indices. 我们还可以使用其他东西来reindex ,例如range(0, 20, 2) reindex range(0, 20, 2)到样本到偶数整数索引。

Alternative, this is one thing that can be done 另外,这是可以做的一件事

def resample(df, rule, how=None, **kwargs):
    import pandas as pd
    if how==None:
        import numpy as np
        how = np.mean

    if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):
        return df.resample(rule, how, **kwargs)
    else:
        idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)
        aux = df.groupby(idx).apply(how)
        aux = aux.set_index(bins[:-1])
        return aux

@piSquared solution is really nice, but I don't like picking index per hand at reindexing. @piSquared解决方案非常好,但我不喜欢在重新索引时选择每手索引。

This should works too for each kind of downsampling (float index too) and automatically pick the mean of the index in each range: 这对于每种下采样(浮点索引)也应该有效,并自动选择每个范围中索引的均值:

df = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])
df.index.name = 'crazy_index'

s = (df.index.to_series() / 10).astype(int)

Now you can pick the function you want to calculate in each sub group at your will: 现在,您可以随意选择要在每个子组中计算的函数:

# calculate std() in each group
df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

                    A         B
crazy_index
3.667539     0.276986  0.317642
14.275074    0.248700  0.372551
25.054042    0.254860  0.297586

# calculate median() in each group
df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )
Out[38]:
                    A         B
crazy_index
3.667539     0.454654  0.521649
14.275074    0.451265  0.490125
25.054042    0.489326  0.622781

EDIT : There were some errors in s indexing, now it is correct & working. 编辑:索引中存在一些错误,现在它是正确的和正常的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM