[英]Pandas Q-cut: Binning Data using an Expanding Window Approach
This question is somewhat similar to a 2018 question I have found on an identical topic.这个问题有点类似于我在同一主题上发现的2018 年问题。
I am hoping that if I ask it in a simpler way, someone will be able to figure out a simple fix to the issue that I am currently facing:我希望如果我以更简单的方式提问,有人将能够找出我目前面临的问题的简单解决方案:
I have a timeseries dataframe named "df", which is roughly structured as follows:我有一个名为“df”的时间序列数据框,其结构大致如下:
V_1 V_2 V_3 V_4
1/1/2000 17 77 15 88
1/2/2000 85 78 6 59
1/3/2000 31 9 49 16
1/4/2000 81 55 28 33
1/5/2000 8 82 82 4
1/6/2000 89 87 57 62
1/7/2000 50 60 54 49
1/8/2000 65 84 29 26
1/9/2000 12 57 53 84
1/10/2000 6 27 70 56
1/11/2000 61 6 38 38
1/12/2000 22 8 82 58
1/13/2000 17 86 65 42
1/14/2000 9 27 42 86
1/15/2000 63 78 18 35
1/16/2000 73 13 51 61
1/17/2000 70 64 75 83
If I wanted to use all the columns to produce daily quantiles, I would follow this approach:如果我想使用所有列来生成每日分位数,我会遵循以下方法:
quantiles = df.apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
The output looks like this:输出如下所示:
V_1 V_2 V_3 V_4
2000-01-01 1 3 0 4
2000-01-02 4 3 0 3
2000-01-03 2 0 2 0
2000-01-04 4 1 0 0
2000-01-05 0 4 4 0
2000-01-06 4 4 3 3
2000-01-07 2 2 3 2
2000-01-08 3 4 1 0
2000-01-09 0 2 2 4
2000-01-10 0 1 4 2
2000-01-11 2 0 1 1
2000-01-12 1 0 4 2
2000-01-13 1 4 3 1
2000-01-14 0 1 1 4
2000-01-15 3 3 0 1
2000-01-16 4 0 2 3
2000-01-17 3 2 4 4
What I want to do:我想做的事:
I would like to produce quantiles of the data in "df" using observations that occurred before and at a specific point in time.我想使用在特定时间点之前和发生的观察来生成“df”中数据的分位数。 I do not want to include observations that occurred after the specific point in time.
我不希望包括在特定时刻后发生的观察。
For instance:例如:
Otherwise put, I would like to use this approach to calculate the bins for ALL the datapoints in "df".否则,我想使用这种方法来计算“df”中所有数据点的箱。 That is, to calculate bins from the 1st of January 2000 to the 17th of January 2000.
也就是说,计算从 2000 年 1 月 1 日到 2000 年 1 月 17 日的 bin。
In short, what I want to do is to conduct an expanding window q-cut (if there is any such thing).总之,我想做的是进行一个扩展窗口q-cut(如果有的话)。 It helps to avoid "look-ahead" bias when dealing with timeseries data.
在处理时间序列数据时,它有助于避免“前瞻”偏差。
This code block below is wrong, but it illustrates exactly what I am trying to accomplish:下面的这个代码块是错误的,但它准确地说明了我想要完成的事情:
quantiles = df.expanding().apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
Does anyone have any ideas of how to do this in a simpler fashion than this有没有人知道如何以比这更简单的方式做到这一点
I am new so take this with a grain of salt, but when broken down I believe your question is a duplicate because it requires simple datetime index slicing answered HERE .我是新手,所以对此持保留态度,但是当分解时,我相信您的问题是重复的,因为它需要简单的日期时间索引切片在此处回答。
lt_jan_5 = df.loc[:'2000-01-05'].apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
print(lt_jan_5)
V_1 V_2 V_3 V_4
2000-01-01 1 2 1 4
2000-01-02 4 3 0 3
2000-01-03 2 0 3 1
2000-01-04 3 1 2 2
2000-01-05 0 4 4 0
Hope this is helpful希望这有帮助
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.