简体   繁体   English

基于另一列中的 boolean 值的同一列的不同聚合总和

[英]different aggregated sums of the same column based on boolean values in another column

I have a dataframe accounting different LEGO pieces contained in each of my LEGO set boxes.我有一个 dataframe 记录每个乐高套装盒中包含的不同乐高积木。 For each set box, there are always many different regular pieces, but somemtimes the box contains also some additional spare pieces.对于每个套装盒,总是有许多不同的常规件,但有时盒子还包含一些额外的备用件。 So the dataframe has a boolean column to distinguish that condition.所以 dataframe 有一个 boolean 列来区分这种情况。

Now I want to summarize the dataset so I get just one row per LEGO set (groupby set_id) with a new column for the total amount of pieces in that set box (aggregated sum of "quantity").现在我想总结数据集,所以我只得到每个乐高集(groupby set_id)的一行,其中一个新列是该集框中的总数量(“数量”的总和)。

My problem is that I also want two additional columns for accounting how many of those pieces are "regular" and how many are "spare", based on the True/False column.我的问题是,我还想要两个额外的列来计算这些部分中有多少是“常规的”,有多少是“备用的”,基于 True/False 列。

Is there any way of calculating those three sum columns by creating just one additional dataframe and just one.agg() call ?有没有办法通过创建一个额外的 dataframe 和一个.agg() 调用来计算这三个总和列

Instead of creating 3 dataframes and merging columns, which is my current approach:而不是创建 3 个数据框和合并列,这是我目前的方法:

import pandas as pd
import random
random.seed(1)

# creating sample data:
nrows=15
df = pd.DataFrame([], columns=["set_id","part_id","quantity","is_spare"])
df["set_id"]=["ABC"[random.randint(0,2)] for r in range(0,nrows)]
df["part_id"] = [random.randint(1000,8000) for n in range(0,nrows)]
df["quantity"] = [random.randint(1,10) for n in range(0,nrows)]
df["is_spare"]=[random.random()>0.75 for r in range(0,nrows)]
print(df)

# grouping into a new dfsummary dataframe: HOW TO DO IT IN JUST ONE STEP ?

# aggregate sum of ALL pieces:
dfsummary = df.groupby("set_id", as_index=False) \
  .agg(num_pieces=("quantity","sum"))

# aggregate sum of "normal" pieces:
dfsummary2 = df.loc[df["is_spare"]==False].groupby("set_id", as_index=False) \
  .agg(normal_pieces=("quantity","sum"))

# aggregate sum of "spare" pieces:
dfsummary3 = df.loc[df["is_spare"]==True].groupby("set_id", as_index=False) \
  .agg(spare_pieces=("quantity","sum"))

# Putting all aggregate columns together:
dfsummary = dfsummary \
  .merge(dfsummary2,on="set_id",how="left") \
  .merge(dfsummary3,on="set_id",how="left")

print(dfsummary)

ORIGINAL DATA:原始数据:

   set_id  part_id  quantity  is_spare
0       A     4545         1     False
1       C     5976         1     False
2       A     7244         9     False
3       B     7284         1     False
4       A     1017         7     False
5       B     6700         4      True
6       B     4648         7     False
7       B     3181         1     False
8       C     6910         9     False
9       B     7568         4      True
10      A     2874         8      True
11      A     5842         8     False
12      B     1837         9     False
13      A     3600         4     False
14      B     1250         6     False

SUMMARIZED DATA:汇总数据:

  set_id  num_pieces  normal_pieces  spare_pieces
0      A          37             29           8.0
1      B          32             24           8.0
2      C          10             10           NaN

I saw this Stackoverflow question , but my case is somehow different because the sum() functions would only be applied to some rows of the target column depending on other column's True/False values.我看到了这个Stackoverflow question ,但我的情况有些不同,因为 sum() 函数只会应用于目标列的某些行,具体取决于其他列的 True/False 值。

You can do it in one line.您可以在一行中完成。 The trick is to create a temporary column where quantity is negative for spare_pieces and positive for normal_pieces :诀窍是创建一个临时列,其中的数量对于spare_pieces为负,对于正常件为normal_pieces

out = df.assign(qty=df['is_spare'].replace({True: -1, False: 1}) * df['quantity']) \
        .groupby('set_id')['qty'] \
        .agg(num_pieces=lambda x: sum(abs(x)), 
             normal_pieces=lambda x: sum(x[x > 0]),
             sparse_pieces=lambda x: abs(sum(x[x < 0]))) \
        .reset_index()

Output: Output:

>>> out
  set_id  num_pieces  normal_pieces  sparse_pieces
0      A          37             29              8
1      B          32             24              8
2      C          10             10              0

>>> df['is_spare'].replace({True: -1, False: 1}) * df['quantity'])
0     1  # normal_pieces
1     1
2     9
3     1
4     7
5    -4  # spare_pieces
6     7
7     1
8     9
9    -4
10   -8
11    8
12    9
13    4
14    6
dtype: int64

One option is to do a groupby and unstack:一种选择是进行 groupby 和 unstack:

(df
.groupby(['set_id', 'is_spare'])
.quantity
.sum()
.unstack('is_spare')
.rename(columns={False:'normal_pieces', True:'spare_pieces'})
.assign(num_pieces = lambda df: df.sum(axis = 'columns'))
.rename_axis(columns=None)
.reset_index()
)

  set_id  normal_pieces  spare_pieces  num_pieces
0      A           29.0           8.0        37.0
1      B           24.0           8.0        32.0
2      C           10.0           NaN        10.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 单独列的连续相同值 boolean 值的 A 列值的总和 - Sums of Column A values for contigious same-value boolean values of separate column 根据另一列的百分位值在 pandas 数据名中创建 boolean 列 - Create a boolean column in pandas datafame based on percentile values of another column python pandas-根据另一列的内容将列中的值更改为布尔值 - python pandas - change values in column to Boolean based on content of another column 根据其他列中的布尔值添加新列 - Add new column based on boolean values in a different column 根据熊猫中的另一列进行求和 - Running sums based on another column in Pandas 如何创建另一列,其中包含基于 Pandas 数据框中同一分类列的两个不同值的操作? - How to create another column that contains an operation based on two different values of a same categorical column in a pandas dataframe? 根据另一列中的相同值将值分配给列 - Assign values to column based on same values in another column pandas DataFrame:根据另一列中的 boolean 值计算 Sum - pandas DataFrame: Calculate Sum based on boolean values in another column python/pandas:根据包含同一列总和的系列更新列 - python/pandas: update a column based on a series holding sums of that same column 将具有不同列名的Dataframe与聚合列值合并 - Merging Dataframes with different column names with aggregated column values
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM