简体   繁体   中英

Speed up selective cumulative sum based on another column in a group

I have a data frame where I want to groupby 2 columns and then create a new column that will have the cumulative sum of a 3rd column where the count depends on the value of a fourth column. I have code that works but it is incredibly slow. How do I speed it up?

So in the example below I want the cumulative sum of Qty if dir is equal to up by date and sym .

In R with data.table this would be a simple one liner that finishes execution very fast:

d1[,newColName:=cumsum(Qty*(dir=="up")),by=c("date","sym")]

What I came up with in python and pandas , is a really slow (but working) function with the following use:

def test(x):
  return pd.Series([ a*b for a,b in zip([ 1 if y == "up" else 0 for y in x["dir"] ], x["Qty"].tolist()) ]).cumsum()

# example use
d1[1:20].groupby(["date","sym"])[["dir","Qty"]].apply(test) # too slow to run over he whole data set

a example chunk of the data:

d1[["date","sym","dir","Qty" ]]
Out[102]: 
             date sym   dir  Qty
0      2019-10-29  A1    up    9
1      2019-10-29  A1  down    1
2      2019-10-29  A1  down   11
3      2019-10-29  A1    up    2
4      2019-10-29  A1    up    3

How do I get this to speed up in order for me to actually run this over a substantial amount of data in python? It does not have to be pandas btw, but should be python.

So here is the output I am looking to get:

> d1
             date sym  dir Qty newColName
    1: 2019-10-29  A1   up   9          9
    2: 2019-10-29  A1 down   1          9
    3: 2019-10-29  A1 down  11          9
    4: 2019-10-29  A1   up   2         11
    5: 2019-10-29  A1   up   3         14

You can try the following:

>>> df.loc[df.dir.eq('up'), 'newColName'] = df[df.dir.eq('up')].groupby(['date', 'sym'])['Qty'].cumsum()
>>> df['newColName'] = df['newColName'].ffill(downcast='infer')
>>> df
         date sym   dir  Qty  newColName
0  2019-10-29  A1    up    9           9
1  2019-10-29  A1  down    1           9
2  2019-10-29  A1  down   11           9
3  2019-10-29  A1    up    2          11
4  2019-10-29  A1    up    3          14

Or, using the same thing with reindex :

>>> df['newColName'] = (df[df.dir.eq('up')]
         .groupby(['date', 'sym'])['Qty']
         .cumsum().reindex(df.index)
         .ffill(downcast='infer')
    )
>>> df
         date sym   dir  Qty  newColName
0  2019-10-29  A1    up    9           9
1  2019-10-29  A1  down    1           9
2  2019-10-29  A1  down   11           9
3  2019-10-29  A1    up    2          11
4  2019-10-29  A1    up    3          14

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM