简体   繁体   中英

different aggregated sums of the same column based on boolean values in another column

I have a dataframe accounting different LEGO pieces contained in each of my LEGO set boxes. For each set box, there are always many different regular pieces, but somemtimes the box contains also some additional spare pieces. So the dataframe has a boolean column to distinguish that condition.

Now I want to summarize the dataset so I get just one row per LEGO set (groupby set_id) with a new column for the total amount of pieces in that set box (aggregated sum of "quantity").

My problem is that I also want two additional columns for accounting how many of those pieces are "regular" and how many are "spare", based on the True/False column.

Is there any way of calculating those three sum columns by creating just one additional dataframe and just one.agg() call ?

Instead of creating 3 dataframes and merging columns, which is my current approach:

import pandas as pd
import random
random.seed(1)

# creating sample data:
nrows=15
df = pd.DataFrame([], columns=["set_id","part_id","quantity","is_spare"])
df["set_id"]=["ABC"[random.randint(0,2)] for r in range(0,nrows)]
df["part_id"] = [random.randint(1000,8000) for n in range(0,nrows)]
df["quantity"] = [random.randint(1,10) for n in range(0,nrows)]
df["is_spare"]=[random.random()>0.75 for r in range(0,nrows)]
print(df)

# grouping into a new dfsummary dataframe: HOW TO DO IT IN JUST ONE STEP ?

# aggregate sum of ALL pieces:
dfsummary = df.groupby("set_id", as_index=False) \
  .agg(num_pieces=("quantity","sum"))

# aggregate sum of "normal" pieces:
dfsummary2 = df.loc[df["is_spare"]==False].groupby("set_id", as_index=False) \
  .agg(normal_pieces=("quantity","sum"))

# aggregate sum of "spare" pieces:
dfsummary3 = df.loc[df["is_spare"]==True].groupby("set_id", as_index=False) \
  .agg(spare_pieces=("quantity","sum"))

# Putting all aggregate columns together:
dfsummary = dfsummary \
  .merge(dfsummary2,on="set_id",how="left") \
  .merge(dfsummary3,on="set_id",how="left")

print(dfsummary)

ORIGINAL DATA:

   set_id  part_id  quantity  is_spare
0       A     4545         1     False
1       C     5976         1     False
2       A     7244         9     False
3       B     7284         1     False
4       A     1017         7     False
5       B     6700         4      True
6       B     4648         7     False
7       B     3181         1     False
8       C     6910         9     False
9       B     7568         4      True
10      A     2874         8      True
11      A     5842         8     False
12      B     1837         9     False
13      A     3600         4     False
14      B     1250         6     False

SUMMARIZED DATA:

  set_id  num_pieces  normal_pieces  spare_pieces
0      A          37             29           8.0
1      B          32             24           8.0
2      C          10             10           NaN

I saw this Stackoverflow question , but my case is somehow different because the sum() functions would only be applied to some rows of the target column depending on other column's True/False values.

You can do it in one line. The trick is to create a temporary column where quantity is negative for spare_pieces and positive for normal_pieces :

out = df.assign(qty=df['is_spare'].replace({True: -1, False: 1}) * df['quantity']) \
        .groupby('set_id')['qty'] \
        .agg(num_pieces=lambda x: sum(abs(x)), 
             normal_pieces=lambda x: sum(x[x > 0]),
             sparse_pieces=lambda x: abs(sum(x[x < 0]))) \
        .reset_index()

Output:

>>> out
  set_id  num_pieces  normal_pieces  sparse_pieces
0      A          37             29              8
1      B          32             24              8
2      C          10             10              0

>>> df['is_spare'].replace({True: -1, False: 1}) * df['quantity'])
0     1  # normal_pieces
1     1
2     9
3     1
4     7
5    -4  # spare_pieces
6     7
7     1
8     9
9    -4
10   -8
11    8
12    9
13    4
14    6
dtype: int64

One option is to do a groupby and unstack:

(df
.groupby(['set_id', 'is_spare'])
.quantity
.sum()
.unstack('is_spare')
.rename(columns={False:'normal_pieces', True:'spare_pieces'})
.assign(num_pieces = lambda df: df.sum(axis = 'columns'))
.rename_axis(columns=None)
.reset_index()
)

  set_id  normal_pieces  spare_pieces  num_pieces
0      A           29.0           8.0        37.0
1      B           24.0           8.0        32.0
2      C           10.0           NaN        10.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM