基于另一列中的 boolean 值的同一列的不同聚合总和

Question

I have a dataframe accounting different LEGO pieces contained in each of my LEGO set boxes.我有一个 dataframe 记录每个乐高套装盒中包含的不同乐高积木。 For each set box, there are always many different regular pieces, but somemtimes the box contains also some additional spare pieces.对于每个套装盒，总是有许多不同的常规件，但有时盒子还包含一些额外的备用件。 So the dataframe has a boolean column to distinguish that condition.所以 dataframe 有一个 boolean 列来区分这种情况。

Now I want to summarize the dataset so I get just one row per LEGO set (groupby set_id) with a new column for the total amount of pieces in that set box (aggregated sum of "quantity").现在我想总结数据集，所以我只得到每个乐高集（groupby set_id）的一行，其中一个新列是该集框中的总数量（“数量”的总和）。

My problem is that I also want two additional columns for accounting how many of those pieces are "regular" and how many are "spare", based on the True/False column.我的问题是，我还想要两个额外的列来计算这些部分中有多少是“常规的”，有多少是“备用的”，基于 True/False 列。

Is there any way of calculating those three sum columns by creating just one additional dataframe and just one.agg() call ?有没有办法通过创建一个额外的 dataframe 和一个.agg() 调用来计算这三个总和列？

Instead of creating 3 dataframes and merging columns, which is my current approach:而不是创建 3 个数据框和合并列，这是我目前的方法：

import pandas as pd
import random
random.seed(1)

# creating sample data:
nrows=15
df = pd.DataFrame([], columns=["set_id","part_id","quantity","is_spare"])
df["set_id"]=["ABC"[random.randint(0,2)] for r in range(0,nrows)]
df["part_id"] = [random.randint(1000,8000) for n in range(0,nrows)]
df["quantity"] = [random.randint(1,10) for n in range(0,nrows)]
df["is_spare"]=[random.random()>0.75 for r in range(0,nrows)]
print(df)

# grouping into a new dfsummary dataframe: HOW TO DO IT IN JUST ONE STEP ?

# aggregate sum of ALL pieces:
dfsummary = df.groupby("set_id", as_index=False) \
  .agg(num_pieces=("quantity","sum"))

# aggregate sum of "normal" pieces:
dfsummary2 = df.loc[df["is_spare"]==False].groupby("set_id", as_index=False) \
  .agg(normal_pieces=("quantity","sum"))

# aggregate sum of "spare" pieces:
dfsummary3 = df.loc[df["is_spare"]==True].groupby("set_id", as_index=False) \
  .agg(spare_pieces=("quantity","sum"))

# Putting all aggregate columns together:
dfsummary = dfsummary \
  .merge(dfsummary2,on="set_id",how="left") \
  .merge(dfsummary3,on="set_id",how="left")

print(dfsummary)

ORIGINAL DATA:原始数据：

   set_id  part_id  quantity  is_spare
0       A     4545         1     False
1       C     5976         1     False
2       A     7244         9     False
3       B     7284         1     False
4       A     1017         7     False
5       B     6700         4      True
6       B     4648         7     False
7       B     3181         1     False
8       C     6910         9     False
9       B     7568         4      True
10      A     2874         8      True
11      A     5842         8     False
12      B     1837         9     False
13      A     3600         4     False
14      B     1250         6     False

SUMMARIZED DATA:汇总数据：

  set_id  num_pieces  normal_pieces  spare_pieces
0      A          37             29           8.0
1      B          32             24           8.0
2      C          10             10           NaN

I saw this Stackoverflow question , but my case is somehow different because the sum() functions would only be applied to some rows of the target column depending on other column's True/False values.我看到了这个Stackoverflow question ，但我的情况有些不同，因为 sum() 函数只会应用于目标列的某些行，具体取决于其他列的 True/False 值。

Answer 1

You can do it in one line.您可以在一行中完成。 The trick is to create a temporary column where quantity is negative for spare_pieces and positive for normal_pieces :诀窍是创建一个临时列，其中的数量对于spare_pieces为负，对于正常件为normal_pieces ：

out = df.assign(qty=df['is_spare'].replace({True: -1, False: 1}) * df['quantity']) \
        .groupby('set_id')['qty'] \
        .agg(num_pieces=lambda x: sum(abs(x)), 
             normal_pieces=lambda x: sum(x[x > 0]),
             sparse_pieces=lambda x: abs(sum(x[x < 0]))) \
        .reset_index()

Output: Output：

>>> out
  set_id  num_pieces  normal_pieces  sparse_pieces
0      A          37             29              8
1      B          32             24              8
2      C          10             10              0

>>> df['is_spare'].replace({True: -1, False: 1}) * df['quantity'])
0     1  # normal_pieces
1     1
2     9
3     1
4     7
5    -4  # spare_pieces
6     7
7     1
8     9
9    -4
10   -8
11    8
12    9
13    4
14    6
dtype: int64

Answer 2

One option is to do a groupby and unstack:一种选择是进行 groupby 和 unstack：

(df
.groupby(['set_id', 'is_spare'])
.quantity
.sum()
.unstack('is_spare')
.rename(columns={False:'normal_pieces', True:'spare_pieces'})
.assign(num_pieces = lambda df: df.sum(axis = 'columns'))
.rename_axis(columns=None)
.reset_index()
)

  set_id  normal_pieces  spare_pieces  num_pieces
0      A           29.0           8.0        37.0
1      B           24.0           8.0        32.0
2      C           10.0           NaN        10.0

基于另一列中的 boolean 值的同一列的不同聚合总和

问题描述

ORIGINAL DATA:原始数据：

SUMMARIZED DATA:汇总数据：

2 个解决方案

解决方案1
0 2022-01-02 20:08:45

解决方案2
0 2022-01-02 20:23:02

基于另一列中的 boolean 值的同一列的不同聚合总和

问题描述

ORIGINAL DATA:原始数据：

SUMMARIZED DATA:汇总数据：

2 个解决方案

解决方案1 0 2022-01-02 20:08:45

解决方案2 0 2022-01-02 20:23:02

解决方案1
0 2022-01-02 20:08:45

解决方案2
0 2022-01-02 20:23:02