[英]Applying rolling window over non-consecutive values in pandas
I need to calculate a new column for a dataframe with a given structure by applying a rolling window to values that are not positioned next to each other in the dataframe. 我需要通过将滚动窗口应用于在数据帧中彼此不相邻的值,为具有给定结构的数据帧计算新列。
My dataframe is defined by something like this: 我的数据框是由这样的东西定义的:
df = pd.DataFrame([
{'date': date(2019,1,1), 'id': 1, 'value': 1},
{'date': date(2019,1,1), 'id': 2, 'value': 10},
{'date': date(2019,1,1), 'id': 3, 'value': 100},
{'date': date(2019,1,2), 'id': 1, 'value': 2},
{'date': date(2019,1,2), 'id': 2, 'value': 20},
{'date': date(2019,1,2), 'id': 3, 'value': 200},
{'date': date(2019,1,3), 'id': 1, 'value': 3},
{'date': date(2019,1,3), 'id': 2, 'value': 30},
{'date': date(2019,1,3), 'id': 3, 'value': 300},
{'date': date(2019,1,6), 'id': 1, 'value': 4},
{'date': date(2019,1,6), 'id': 2, 'value': 40},
{'date': date(2019,1,6), 'id': 3, 'value': 400},
])
df=df.set_index(['date', 'id'], drop=False).sort_index()
which gives as df looking like this: 给出的df如下所示:
date id value
date id
--------------+--------------------------
2019-01-01 1 | 2019-01-01 1 1
2 | 2019-01-01 2 10
3 | 2019-01-01 3 100
2019-01-02 1 | 2019-01-02 1 2
2 | 2019-01-02 2 20
3 | 2019-01-02 3 200
2019-01-03 1 | 2019-01-03 1 3
2 | 2019-01-03 2 30
3 | 2019-01-03 3 300
2019-01-06 1 | 2019-01-06 1 4
2 | 2019-01-06 2 40
3 | 2019-01-06 3 400
I want to measure the change in column value from one (given) day to the next for each id . 我想测量每个id从一天(给定)到第二天的列值的变化。 So for id==1
the change from 2019-01-01
to 2019-01-02
is (2-1) / 1 = 2
, and from 2019-01-03
to 2019-01-06
is (4-3) / 3 = 0.333
. 因此对于id==1
,从2019-01-01
到2019-01-02
的更改是(2-1) / 1 = 2
,从2019-01-03
到2019-01-06
是(4-3) / 3 = 0.333
。
I can calculate the desired column if i restructure the df like this so that all values are next to each other: 如果我像这样重构df,以便所有值彼此相邻,则可以计算所需的列:
restructured = df.reset_index(drop=True).set_index(['date']).sort_index()
df1 = restructured.groupby('id').rolling(2).apply(lambda x: (x.max()-x.min())/x.min(), raw=False)
resulting in the desired value(s) in column value : 在列值中产生所需的值 :
id value
id date
---------------+--------------------
1 2019-01-01 | NaN NaN
2019-01-02 | 0.0 1.000000
2019-01-03 | 0.0 0.500000
2019-01-06 | 0.0 0.333333
2 2019-01-01 | NaN NaN
2019-01-02 | 0.0 1.000000
2019-01-03 | 0.0 0.500000
2019-01-06 | 0.0 0.333333
3 2019-01-01 | NaN NaN
2019-01-02 | 0.0 1.000000
2019-01-03 | 0.0 0.500000
2019-01-06 | 0.0 0.333333
How can I join/merge this column to df in the original structure or calculate the values in another way so that the resulting dataframe looks like this (first df with added column change_pct ): 我如何在原始结构中将此列连接/合并到df或以另一种方式计算值,以使结果数据帧如下所示(第一个添加列change_pct的 df):
date id value change_pct
date id
--------------+---------------------------------
2019-01-01 1 | 2019-01-01 1 1 NaN
2 | 2019-01-01 2 10 NaN
3 | 2019-01-01 3 100 NaN
2019-01-02 1 | 2019-01-02 1 2 1.000000
2 | 2019-01-02 2 20 1.000000
3 | 2019-01-02 3 200 1.000000
2019-01-03 1 | 2019-01-03 1 3 0.500000
2 | 2019-01-03 2 30 0.500000
3 | 2019-01-03 3 300 0.500000
2019-01-06 1 | 2019-01-06 1 4 0.333333
2 | 2019-01-06 2 40 0.333333
3 | 2019-01-06 3 400 0.333333
IIUC, this might be more simple. IIUC,这可能更简单。
df['change_pct']=df.groupby('id')['value'].pct_change()
To do this, DO NOT run this df=df.set_index(['date', 'id'], drop=False).sort_index()
. 为此,请不要运行df=df.set_index(['date', 'id'], drop=False).sort_index()
。 Just run the above line directly on your df. 只需直接在df上运行以上行即可。
Output 输出量
date id value change_pct
0 2019-01-01 1 1 NaN
1 2019-01-01 2 10 NaN
2 2019-01-01 3 100 NaN
3 2019-01-02 1 2 1.000000
4 2019-01-02 2 20 1.000000
5 2019-01-02 3 200 1.000000
6 2019-01-03 1 3 0.500000
7 2019-01-03 2 30 0.500000
8 2019-01-03 3 300 0.500000
9 2019-01-06 1 4 0.333333
10 2019-01-06 2 40 0.333333
11 2019-01-06 3 400 0.333333
您可以groupby
与该指数的一部分level
kwarg:
df.value.groupby(id, level=1).rolling(2).apply(lambda x: (x.max()-x.min())/x.min(), raw=False)
The answer by SH-SF guided me to solve the problem: SH-SF的回答指导我解决了这个问题:
The problem becomes easy, if I just work on the non-indexed df: 如果我只处理非索引df,问题就变得容易了:
df = pd.DataFrame([
{'date': date(2019,1,1), 'id': 1, 'value': 1},
{'date': date(2019,1,1), 'id': 2, 'value': 10},
{'date': date(2019,1,1), 'id': 3, 'value': 100},
{'date': date(2019,1,2), 'id': 1, 'value': 2},
{'date': date(2019,1,2), 'id': 2, 'value': 20},
{'date': date(2019,1,2), 'id': 3, 'value': 200},
{'date': date(2019,1,3), 'id': 1, 'value': 3},
{'date': date(2019,1,3), 'id': 2, 'value': 30},
{'date': date(2019,1,3), 'id': 3, 'value': 300},
{'date': date(2019,1,6), 'id': 1, 'value': 4},
{'date': date(2019,1,6), 'id': 2, 'value': 40},
{'date': date(2019,1,6), 'id': 3, 'value': 400},
])
df=df.sort_values(['id', 'date']) # make sure everything is in correct order
window_size=2 # the window size is adjustable
#calculate values
c= df.groupby('id')['value'].rolling(window_size).apply(lambda x: (x.max()-x.min())/x.min(), raw=False)
df[change_pct] = c.values # create new column in df
#now I can create the structure I need
df=df.set_index(['date', 'id'], drop=False).sort_index()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.