简体   繁体   English

Pandas Q-cut:使用扩展窗口方法对数据进行分箱

[英]Pandas Q-cut: Binning Data using an Expanding Window Approach

This question is somewhat similar to a 2018 question I have found on an identical topic.这个问题有点类似于我在同一主题上发现的2018 年问题

I am hoping that if I ask it in a simpler way, someone will be able to figure out a simple fix to the issue that I am currently facing:我希望如果我以更简单的方式提问,有人将能够找出我目前面临的问题的简单解决方案:

I have a timeseries dataframe named "df", which is roughly structured as follows:我有一个名为“df”的时间序列数据框,其结构大致如下:

            V_1   V_2  V_3  V_4
1/1/2000    17    77   15   88
1/2/2000    85    78    6   59
1/3/2000    31    9    49   16
1/4/2000    81    55   28   33
1/5/2000    8     82   82   4
1/6/2000    89    87   57   62
1/7/2000    50    60   54   49
1/8/2000    65    84   29   26
1/9/2000    12    57   53   84
1/10/2000   6     27   70   56
1/11/2000   61    6    38   38
1/12/2000   22    8    82   58
1/13/2000   17    86   65   42
1/14/2000   9     27   42   86
1/15/2000   63    78   18   35
1/16/2000   73    13   51   61
1/17/2000   70    64   75   83

If I wanted to use all the columns to produce daily quantiles, I would follow this approach:如果我想使用所有列来生成每日分位数,我会遵循以下方法:

quantiles =  df.apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)

The output looks like this:输出如下所示:

           V_1  V_2 V_3 V_4
2000-01-01  1   3   0   4
2000-01-02  4   3   0   3
2000-01-03  2   0   2   0
2000-01-04  4   1   0   0
2000-01-05  0   4   4   0
2000-01-06  4   4   3   3
2000-01-07  2   2   3   2
2000-01-08  3   4   1   0
2000-01-09  0   2   2   4
2000-01-10  0   1   4   2
2000-01-11  2   0   1   1
2000-01-12  1   0   4   2
2000-01-13  1   4   3   1
2000-01-14  0   1   1   4
2000-01-15  3   3   0   1
2000-01-16  4   0   2   3
2000-01-17  3   2   4   4

What I want to do:我想做的事:

I would like to produce quantiles of the data in "df" using observations that occurred before and at a specific point in time.我想使用特定时间点之前和发生的观察来生成“df”中数据的分位数。 I do not want to include observations that occurred after the specific point in time.希望包括在特定时刻发生的观察。

For instance:例如:

  • To calculate the bins for the 2nd of January 2000, I would like to just use observations from the 1st and 2nd of January 2000;为了计算 2000 年 1 月 2 日的 bin,我只想使用 2000 年 1 月 1 日和 2 日的观测值; and, nothing after the dates;并且,日期之后没有任何内容;
  • To calculate the bins for the 3rd of January 2000, I would like to just use observations from the 1st, 2nd and 3rd of January 2000;为了计算 2000 年 1 月 3 日的 bin,我只想使用 2000 年 1 月 1 日、2 日和 3 日的观测值; and, nothing after the dates;并且,日期之后没有任何内容;
  • To calculate the bins for the 4th of January 2000, I would like to just use observations from the 1st, 2nd, 3rd and 4th of January 2000;为了计算 2000 年 1 月 4 日的 bin,我只想使用 2000 年 1 月 1 日、2 日、3 日和 4 日的观测值; and, nothing after the dates;并且,日期之后没有任何内容;
  • To calculate the bins for the 5th of January 2000, I would like to just use observations from the 1st, 2nd, 3rd, 4th and 5th of January 2000;为了计算 2000 年 1 月 5 日的 bin,我只想使用 2000 年 1 月 1 日、2 日、3 日、4 日和 5 日的观测值; and, nothing after the dates;并且,日期之后没有任何内容;

Otherwise put, I would like to use this approach to calculate the bins for ALL the datapoints in "df".否则,我想使用这种方法来计算“df”中所有数据点的箱。 That is, to calculate bins from the 1st of January 2000 to the 17th of January 2000.也就是说,计算从 2000 年 1 月 1 日到 2000 年 1 月 17 日的 bin。

In short, what I want to do is to conduct an expanding window q-cut (if there is any such thing).总之,我想做的是进行一个扩展窗口q-cut(如果有的话)。 It helps to avoid "look-ahead" bias when dealing with timeseries data.在处理时间序列数据时,它有助于避免“前瞻”偏差。

This code block below is wrong, but it illustrates exactly what I am trying to accomplish:下面的这个代码块是错误的,但它准确地说明了我想要完成的事情:

quantiles =  df.expanding().apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)

Does anyone have any ideas of how to do this in a simpler fashion than this有没有人知道如何以比这更简单的方式做到这一点

I am new so take this with a grain of salt, but when broken down I believe your question is a duplicate because it requires simple datetime index slicing answered HERE .我是新手,所以对此持保留态度,但是当分解时,我相信您的问题是重复的,因为它需要简单的日期时间索引切片在此处回答。

lt_jan_5 = df.loc[:'2000-01-05'].apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)

print(lt_jan_5)

            V_1  V_2  V_3  V_4
2000-01-01    1    2    1    4
2000-01-02    4    3    0    3
2000-01-03    2    0    3    1
2000-01-04    3    1    2    2
2000-01-05    0    4    4    0

Hope this is helpful希望这有帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM