简体   繁体   English

Pandas Groupby 日期时间列上多列的滚动总和

[英]Pandas Groupby rolling sum of multiple columns on datetime column

I am trying to get a rolling sum of multiple columns by group, rolling on a datetime column (ie over a specified time interval).我正在尝试按组获取多列的滚动总和,在日期时间列上滚动(即在指定的时间间隔内)。 Rolling of one column seems to be working fine, but when I roll over multiple columns by vectorizing, I am getting unexpected results.滚动一列似乎工作正常,但是当我通过矢量化滚动多列时,我得到了意想不到的结果。

My first attempt:我的第一次尝试:

df = pd.DataFrame({"column1": range(6), 
                   "column2": range(6), 
                   'group': 3*['A','B'], 
                   'date':pd.date_range("20190101", periods=6)})

(df.groupby('group').rolling("1d", on='date')['column1'].sum()).groupby('group').shift(fill_value=0)

# output:
group  date      
A      2019-01-01    0.0
       2019-01-03    0.0
       2019-01-05    2.0
B      2019-01-02    0.0
       2019-01-04    1.0
       2019-01-06    3.0
Name: column1, dtype: float64

The above produced the desired results, however I lost the original index in the process.以上产生了预期的结果,但是我在此过程中丢失了原始索引。 Since in my data some dates are the same, I would have to join back on the original dataframe on group+date which is inefficient.由于在我的数据中某些日期是相同的,因此我必须在 group+date 上重新加入原始 dataframe,这是低效的。 I therefore applied the following to avoid this and to keep the original index:因此,我应用了以下方法来避免这种情况并保留原始索引:

df.groupby('group').apply(lambda x: x.rolling("1d", on='date')['column1'].sum().shift(fill_value=0))

# output:
group   
A      0    0.0
       2    0.0
       4    2.0
B      1    0.0
       3    1.0
       5    3.0
Name: column1, dtype: float64

With this I can easily assign it to a new column of the original df by sorting on the index.有了这个,我可以通过对索引进行排序轻松地将它分配给原始 df 的新列。 Now I would like to repeat same for 'column2' and do this by vectorization.现在我想对“column2”重复相同的操作,并通过矢量化来做到这一点。 However, the result I get is unexpected:但是,我得到的结果是出乎意料的:

df.groupby('group').apply(lambda x: x.rolling("1d", on='date')[['column1','column2']].sum().shift(fill_value=0))

# output:

   column1  column2       date
0      0.0      0.0 1970-01-01
1      0.0      0.0 1970-01-01
2      0.0      0.0 2019-01-01
3      1.0      1.0 2019-01-02
4      2.0      2.0 2019-01-03
5      3.0      3.0 2019-01-04

The result is correct, but unexpected for the following reasons: (1) group_keys in the groupby is ignored (2) It sorted the result automatically and reset the index like in a 'transform' method.结果是正确的,但出于以下原因出乎意料:(1)groupby 中的 group_keys 被忽略(2)它自动对结果进行排序并重置索引,就像在“转换”方法中一样。

I would like to understand why this happened and also are there alternative ways to achieve the results above.我想了解为什么会发生这种情况,还有其他方法可以实现上述结果。

I took your original approach and did some changes.我采用了你原来的方法并做了一些改变。 Can you check if this is what you wanted?你能检查这是否是你想要的吗?

Reset the index of the original data frame and assign the original index a column name.重置原始数据框的索引,并为原始索引分配列名。

df = df.reset_index().rename(columns={df.index.name: 'index'})

Now, you have the same original data frame, but it has an additional column called index that is the original index.现在,您拥有相同的原始数据框,但它有一个名为index的附加列,它是原始索引。

Apply the rolling on the groupby data frame grouped by group and index columns on the 2 columns column1 and column2 .在 2 列column1column2上按groupindex列分组的groupby数据帧上应用rolling

(df.groupby(['group', 'index']).rolling("1d", on='date')[['column1', 'column2']].sum()).groupby('group').shift(fill_value=0)

Result:结果:

                        column1  column2
group index date                        
A     0     2019-01-01      0.0      0.0
      2     2019-01-03      0.0      0.0
      4     2019-01-05      2.0      2.0
B     1     2019-01-02      0.0      0.0
      3     2019-01-04      1.0      1.0
      5     2019-01-06      3.0      3.0

If you want the original index back, reset the multi-index and set the 'index' as index如果您想要原始索引,请重置多索引并将“索引”设置为索引

(df.groupby(['group', 'index']).rolling("1d", on='date')[['column1', 'column2']].sum()).groupby('group').shift(fill_value=0).reset_index().set_index('index')

Result:结果:

      group       date  column1  column2
index                                   
0         A 2019-01-01      0.0      0.0
2         A 2019-01-03      0.0      0.0
4         A 2019-01-05      2.0      2.0
1         B 2019-01-02      0.0      0.0
3         B 2019-01-04      1.0      1.0
5         B 2019-01-06      3.0      3.0

Add a .sort_index() if you want it sorted如果要对其进行排序,请添加.sort_index()

      group       date  column1  column2
index                                   
0         A 2019-01-01      0.0      0.0
1         B 2019-01-02      0.0      0.0
2         A 2019-01-03      0.0      0.0
3         B 2019-01-04      1.0      1.0
4         A 2019-01-05      2.0      2.0
5         B 2019-01-06      3.0      3.0

Hope this helps.希望这可以帮助。 Let me know if I am missing anything.如果我遗漏了什么,请告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM