Sorry if the title is a bit confusing. I really lack my way with words to describe specific pandas challenges. The question can be illustrated with examples below:
I have a dataframe with 3 fields:
import pandas as pd
q =pd.DataFrame({'OrderID':['a1', 'a1','a1','a2', 'a3'],
'Execution_Size': [20, 75, 500, 200, 1000],
'Quote_Size': [300, 300, 300, 500, 600] })
It looks like this:
The group is divided by each order ID value. There is one quote_size
corresponding to each unique OrderID. The second column is the one I'm trying to modify, the execution size.
I've done several steps of processing to make sure that regardless of whether the sum of execution_size
of each group is bigger than its quote size, the group will stop at the last row or the row that makes the cumsum of execution size bigger than quote size.
For instance, here in the first group, the first rows add up to 95 < 300, the quote size, while the last row makes the cumsum bigger than 300.
What I desired is:
Basically for each group of rows with the same OrderID:
Using the example here, the third row of Execution_Size
shall be 300 - (20 + 75) = 205
Here the last row: 1000 > 600. Therefore the result is 600.
Thanks for your time in advance. Any advice is appreciated.
Here's a solution (in several steps, for clarity):
df =pd.DataFrame({'OrderID':['a1', 'a1','a1','a2', 'a3'],
'Execution_Size': [20, 75, 500, 200, 1000],
'Quote_Size': [300, 300, 300, 500, 600] })
df["total_execution"] = df.groupby("OrderID")["Execution_Size"].transform("sum")
last_items = df.groupby("OrderID").apply(pd.DataFrame.last_valid_index).values
df.loc[last_items, "calculated_exec_size"] = df.loc[last_items, "Execution_Size"]
df.loc[last_items, "calculated_exec_size"] -= (df.loc[last_items, "total_execution"] -
df.loc[last_items, "Quote_Size"])
df.loc[last_items, "Execution_Size"] = df.loc[last_items, ["Execution_Size", "calculated_exec_size"]].min(axis=1)
The result is:
OrderID Execution_Size Quote_Size total_execution calculated_exec_size
0 a1 20.0 300 595 NaN
1 a1 75.0 300 595 NaN
2 a1 205.0 300 595 205.0
3 a2 200.0 500 200 500.0
4 a3 600.0 600 1000 600.0
You can try approach with np.select
( comments inline ):
g = q.groupby('OrderID')['Execution_Size'] #create group
s = g.cumsum() #gets groupwise cumulative sum for Execution_Size
#create our conditions below
c1 = s.gt(q['Quote_Size']) #checks if cumulative sum exceeds Quote_Size
c2 = g.transform('count').gt(1) #checks if the group size is more than 1
#checks if Execution_Size > Quote_Size for single groups
c3 = q['Execution_Size'].gt(q['Quote_Size'])
#form a combination of conditions and required values
arr = np.select([c1&c2,c3 & ~c2],[q['Quote_Size'].sub(s.shift()),q['Quote_Size']],
default=q['Execution_Size'])
q['New_Execution_Size'] = arr
print(q)
OrderID Execution_Size Quote_Size New_Execution_Size
0 a1 20 300 20.0
1 a1 75 300 75.0
2 a1 500 300 205.0
3 a2 200 500 200.0
4 a3 1000 600 600.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.