I have a data frame where I want to groupby
2 columns and then create a new column that will have the cumulative sum of a 3rd column where the count depends on the value of a fourth column. I have code that works but it is incredibly slow. How do I speed it up?
So in the example below I want the cumulative sum of Qty
if dir
is equal to up
by date
and sym
.
In R
with data.table
this would be a simple one liner that finishes execution very fast:
d1[,newColName:=cumsum(Qty*(dir=="up")),by=c("date","sym")]
What I came up with in python
and pandas
, is a really slow (but working) function with the following use:
def test(x):
return pd.Series([ a*b for a,b in zip([ 1 if y == "up" else 0 for y in x["dir"] ], x["Qty"].tolist()) ]).cumsum()
# example use
d1[1:20].groupby(["date","sym"])[["dir","Qty"]].apply(test) # too slow to run over he whole data set
a example chunk of the data:
d1[["date","sym","dir","Qty" ]]
Out[102]:
date sym dir Qty
0 2019-10-29 A1 up 9
1 2019-10-29 A1 down 1
2 2019-10-29 A1 down 11
3 2019-10-29 A1 up 2
4 2019-10-29 A1 up 3
How do I get this to speed up in order for me to actually run this over a substantial amount of data in python? It does not have to be pandas
btw, but should be python.
So here is the output I am looking to get:
> d1
date sym dir Qty newColName
1: 2019-10-29 A1 up 9 9
2: 2019-10-29 A1 down 1 9
3: 2019-10-29 A1 down 11 9
4: 2019-10-29 A1 up 2 11
5: 2019-10-29 A1 up 3 14
You can try the following:
>>> df.loc[df.dir.eq('up'), 'newColName'] = df[df.dir.eq('up')].groupby(['date', 'sym'])['Qty'].cumsum()
>>> df['newColName'] = df['newColName'].ffill(downcast='infer')
>>> df
date sym dir Qty newColName
0 2019-10-29 A1 up 9 9
1 2019-10-29 A1 down 1 9
2 2019-10-29 A1 down 11 9
3 2019-10-29 A1 up 2 11
4 2019-10-29 A1 up 3 14
Or, using the same thing with reindex
:
>>> df['newColName'] = (df[df.dir.eq('up')]
.groupby(['date', 'sym'])['Qty']
.cumsum().reindex(df.index)
.ffill(downcast='infer')
)
>>> df
date sym dir Qty newColName
0 2019-10-29 A1 up 9 9
1 2019-10-29 A1 down 1 9
2 2019-10-29 A1 down 11 9
3 2019-10-29 A1 up 2 11
4 2019-10-29 A1 up 3 14
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.