Pandas rolling_max，在df列中指定了可变窗口大小

Question

I'd like to calculate a rolling_max of a pandas column, where the window size varies and is a difference between current row index and a row where a certain condition was met. 我想计算一个pandas列的rolling_max，其中窗口大小不同，是当前行索引与满足某个条件的行之间的差异。

So, as an example, I have: 所以，作为一个例子，我有：

df = pd.DataFrame({'a': [0,1,0,0,0,1,0,0,0,0,1,0],
                   'b': [5,4,3,6,1,2,3,4,2,1,7,8]})

I want a rolling_max of df.b since df.a == 1 the previous time. 我想要一个df.b的rolling_max，因为前一次df.a == 1。 Ie I want to get this: 即我想得到这个：

     a   b   rm
 0   0   5   NaN  <- no previous a==1
 1   1   4   4    <- a==1
 2   0   3   4
 3   0   6   6
 4   0   1   6
 5   1   2   2    <- a==1
 6   0   3   3
 7   0   4   4
 8   0   2   4
 9   0   1   4
10   1   7   7    <- a==1
11   0   8   8

My df has an integer index without gaps, so I tried to do this: 我的df有一个没有间隙的整数索引，所以我试着这样做：

df['last_a'] = np.where(df.a == 1, df.index, np.nan)
df['last_a'].fillna(method='ffill', inplace=True)
df['rm'] = pd.rolling_max(df['b'], window = df.index - df['last_a'] + 1)

but I'm getting a TypeError: an integer is required. 但我得到一个TypeError：需要一个整数。

This is a part of a long script operating on quite a big data frame, so I need the fastest solution possible. 这是在相当大的数据框架上运行的长脚本的一部分，因此我需要尽可能快的解决方案。 I have successfully tried to do this with a loop instead of rolling_max, but it's very slow. 我已成功尝试使用循环而不是rolling_max来执行此操作，但它非常慢。 Could you please help? 能否请你帮忙？

Just for reference. 仅供参考。 The ugly and long loop that I have now, and which, regardless its ugliness, seems to be quite fast on my data frame (50,000 x 25 for a test), is as follows: 我现在拥有的丑陋和长循环，无论它的丑陋，在我的数据框架上看起来相当快（测试时为50,000 x 25），如下所示：

df['rm2'] = df.b
df['rm1'] = np.where( (df['a'] == 1) | (df['rm2'].diff() > 0), df['rm2'], np.nan)
df['rm1'].fillna(method = 'ffill', inplace = True)
df['Dif'] = (df['rm1'] - df['rm2']).abs()
while df['Dif'].sum() != 0:
    df['rm2'] = df['rm1']
    df['rm1'] = np.where( (df['a'] == 1) | (df['rm2'].diff() > 0), df['rm2'], np.nan) 
    df['rm1'].fillna(method = 'ffill', inplace = True)
    df['Dif'] = (df['rm1'] - df['rm2']).abs()

Answer 1

I would create an index and groupby this index to use cummax : 我会创建一个索引和groupby这个索引使用cummax ：

import numpy as np

df['index'] = df['a'].cumsum()
df['rm']    = df.groupby('index')['b'].cummax()

df.loc[df['index']==0, 'rm'] = np.nan

In [104]: df
Out[104]:
    a  b  index  rm
0   0  5      0 NaN
1   1  4      1   4
2   0  3      1   4
3   0  6      1   6
4   0  1      1   6
5   1  2      2   2
6   0  3      2   3
7   0  4      2   4
8   0  2      2   4
9   0  1      2   4
10  1  7      3   7
11  0  8      3   8

Answer 2

Indeed, anytime you require restructuring data that involves relationships between columns and tables, consider an SQL solution using a Relational Database Management System (RDMS). 实际上，只要您需要重构涉及列和表之间关系的数据，请考虑使用关系数据库管理系统（RDMS）的SQL解决方案。 And do so especially if your data derives from a database. 如果您的数据来自数据库，请特别注意。 Leave Pandas for data analysis. 让Pandas进行数据分析。 Of course, if you are storing large data not in a database, then that's whole another issue! 当然，如果您要存储的数据不在数据库中，那么这就是另一个问题！

Python comes equipped with a built-in library for SQLite , the popular free, open-source file-level database. Python为SQLite提供了一个内置库，这是一个流行的免费开源文件级数据库。 Additionally, Python libraries for MySQL, SQL Server, PostgreSQL, Oracle, and other RDMSs are available for install. 此外，可以安装MySQL，SQL Server，PostgreSQL，Oracle和其他RDMS的Python库。 You can integrate each connection seamlessly into pandas . 您可以将每个连接无缝集成到pandas中。 Below are three equivalent versions of queries to achieve your conditional group max. 以下是三个等效版本的查询，以实现条件组最大值。 Each assumes you maintain an autonumber primary key index, ID , in your source table, named here as RollingMax . 每个假定您在源表中维护一个自动编号主键索引ID ，此处称为RollingMax 。

import sqlite3 as lite
import pandas as pd

con = lite.connect('C:\\Path\\SQLite\\DB.db')

# SQL WITH DERIVED TABLES
sql = """SELECT a, b,
               (SELECT Max(dtbl2.B) 
               FROM 
                   (SELECT t1.ID, t1.a, t1.b,
                          (SELECT Count(*) FROM RollingMax t2 
                           WHERE t1.ID >= t2.ID AND t2.A > 0) As GrpA
                    FROM RollingMax t1) dtbl2
               WHERE dtbl1.ID >= dtbl2.ID 
               AND dtbl1.GrpA = dtbl2.GrpA) As rm

         FROM 
         (
              SELECT t1.ID, t1.a, t1.b,
                     (SELECT Count(*) FROM RollingMax t2 
              WHERE t1.ID >= t2.ID AND t2.A > 0) As GrpA
              FROM RollingMax t1
         ) As dtbl1;"""

# SQL USING CTE WINDOW FUNCTION (AVAILABLE AS OF VERSION 3.8.3)
sql = """WITH grp (ID, a, b, GrpA)
         AS  (
              SELECT t1.ID, t1.a, t1.b,
                    (SELECT Count(*) FROM RollingMax t2 
                     WHERE t1.ID >= t2.ID AND t2.A > 0) As GrpA
              FROM RollingMax t1
             )
         SELECT a, b,
               (SELECT Max(dtbl2.B) 
                FROM grp AS dtbl2
                WHERE dtbl1.ID >= dtbl2.ID 
                AND dtbl1.GrpA = dtbl2.GrpA) As rm
         FROM grp AS dtbl1;"""

# SQL USING SAVED VIEW
'''To be saved inside database'''
saved_view = """SELECT t1.ID, t1.a, t1.b,
                  (SELECT Count(*) FROM RollingMax t2 
                   WHERE t1.ID >= t2.ID AND t2.A > 0) As GrpA
                FROM RollingMax t1;"""

sql = """SELECT a, b,
             (SELECT Max(dtbl2.B) 
              FROM saved_view AS dtbl2
              WHERE dtbl1.ID >= dtbl2.ID 
              AND dtbl1.GrpA = dtbl2.GrpA) As rm
         FROM saved_view As dtbl1;"""

df = pd.read_sql(sql, conn)

OUTPUT (only challenge here is the first grouping without preceding a==1) OUTPUT （这里唯一的挑战是没有前面的== 1的第一个分组）

Pandas rolling_max，在df列中指定了可变窗口大小

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-12-02 14:01:41

解决方案2
0 2015-12-03 00:09:24

Pandas rolling_max，在df列中指定了可变窗口大小

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-12-02 14:01:41

解决方案2 0 2015-12-03 00:09:24

解决方案1
2 已采纳 2015-12-02 14:01:41

解决方案2
0 2015-12-03 00:09:24