简体   繁体   中英

Pandas: Modify the value of last cell in each group based on how the groupby sum result compares to the value in another column

Sorry if the title is a bit confusing. I really lack my way with words to describe specific pandas challenges. The question can be illustrated with examples below:

I have a dataframe with 3 fields:

import pandas as pd

q =pd.DataFrame({'OrderID':['a1', 'a1','a1','a2', 'a3'], 
              'Execution_Size': [20, 75, 500, 200, 1000],
              'Quote_Size': [300, 300, 300, 500, 600] })

It looks like this:

问题

The group is divided by each order ID value. There is one quote_size corresponding to each unique OrderID. The second column is the one I'm trying to modify, the execution size.

I've done several steps of processing to make sure that regardless of whether the sum of execution_size of each group is bigger than its quote size, the group will stop at the last row or the row that makes the cumsum of execution size bigger than quote size.

For instance, here in the first group, the first rows add up to 95 < 300, the quote size, while the last row makes the cumsum bigger than 300.

What I desired is:

d

Basically for each group of rows with the same OrderID:

  • if there are multiple rows, and the sum of the execution size of the group is bigger than the quote size, make the execution size value in the last row of the group equal (quote_size - (sum of all the rows in the group but the last row)

Using the example here, the third row of Execution_Size shall be 300 - (20 + 75) = 205

  • if the sum of execution of the group is smaller than the quote, nothing needs to be done regardless of the number of rows.
  • if there is only one execution size for an order ID, and it's bigger than the quote size. Change the value to quote size.

Here the last row: 1000 > 600. Therefore the result is 600.

Thanks for your time in advance. Any advice is appreciated.

Here's a solution (in several steps, for clarity):

df =pd.DataFrame({'OrderID':['a1', 'a1','a1','a2', 'a3'], 
              'Execution_Size': [20, 75, 500, 200, 1000],
              'Quote_Size': [300, 300, 300, 500, 600] })
df["total_execution"] = df.groupby("OrderID")["Execution_Size"].transform("sum")
last_items = df.groupby("OrderID").apply(pd.DataFrame.last_valid_index).values
df.loc[last_items, "calculated_exec_size"] = df.loc[last_items, "Execution_Size"]
df.loc[last_items, "calculated_exec_size"] -= (df.loc[last_items, "total_execution"] - 
                                               df.loc[last_items, "Quote_Size"])
df.loc[last_items, "Execution_Size"] = df.loc[last_items, ["Execution_Size", "calculated_exec_size"]].min(axis=1)

The result is:

  OrderID  Execution_Size  Quote_Size  total_execution  calculated_exec_size
0      a1            20.0         300              595                   NaN
1      a1            75.0         300              595                   NaN
2      a1           205.0         300              595                 205.0
3      a2           200.0         500              200                 500.0
4      a3           600.0         600             1000                 600.0

You can try approach with np.select ( comments inline ):

g = q.groupby('OrderID')['Execution_Size'] #create group
s = g.cumsum() #gets groupwise cumulative sum for Execution_Size
#create our conditions below
c1 = s.gt(q['Quote_Size']) #checks if cumulative sum exceeds Quote_Size
c2 = g.transform('count').gt(1) #checks if the group size is more than 1
#checks if Execution_Size > Quote_Size for single groups
c3 = q['Execution_Size'].gt(q['Quote_Size'])
#form a combination of conditions and required values
arr = np.select([c1&c2,c3 & ~c2],[q['Quote_Size'].sub(s.shift()),q['Quote_Size']],
                                                     default=q['Execution_Size'])
q['New_Execution_Size'] = arr
print(q)

  OrderID  Execution_Size  Quote_Size  New_Execution_Size
0      a1              20         300                20.0
1      a1              75         300                75.0
2      a1             500         300               205.0
3      a2             200         500               200.0
4      a3            1000         600               600.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM