如何使用pyspark有条件地对几列求和？

Question

I'm trying to figure out a way to sum multiple columns but with different conditions in each sum.我试图找出一种方法来对多列求和，但在每个总和中都有不同的条件。

This is the data I have in a dataframe:这是我在数据框中的数据：

order_id    article_id  article_name         nr_of_items price       is_black is_fabric
----------- ----------- -------------------- ----------- ----------- -------- ---------
1           567         batteries            6           5           0        0
1           645         pants                1           20          1        1
2           876         tent                 1           40          0        1
2           434         socks                10          5           1        1

This is what I want:这就是我要的：

order_id    total_order_amount black_order_amount fabric_order_amount
----------- ------------------ ------------------ -------------------
1           50                 20                 20
2           90                 50                 90

This is how it would be accomplished in SQL:这是在 SQL 中完成的方式：

select 
    order_id, 
    sum(nr_of_items*price) as total_order_amount,
    sum(case when is_black = 1 then price*nr_of_items else 0 end) as black_order_amount,
    sum(case when is_fabric = 1 then price*nr_of_items else 0 end) as fabric_order_amount 
from order_lines
group by order_id
;

How do I make the same using pyspark?我如何使用 pyspark 做同样的事情？ Ie the thing I'm wondering is how to aggregate several columns but with different conditions.即我想知道的是如何聚合几个列但条件不同。

I've prepared a pyspark dataframe in case anyone wants to give it a try:我准备了一个 pyspark 数据框，以防有人想试一试：

from pyspark.sql.types import *

cSchema = StructType([StructField("order_id", IntegerType())\
                      ,StructField("article_id", IntegerType())\
                      ,StructField("article_name", StringType())\
                      ,StructField("nr_of_items", IntegerType())\
                      ,StructField("price", IntegerType())\
                      ,StructField("is_black", BooleanType())\
                      ,StructField("is_fabric", BooleanType())])

test_list = [[1, 567, 'batteries', 6, 5, False, False],
             [1, 645, 'pants', 1, 20, True, True],
             [2, 876, 'tent', 1, 40, False, True],
             [2, 434, 'socks', 10, 5, True, True]]

df = spark.createDataFrame(test_list,schema=cSchema)

I'm using spark version 2.4.4 and python version 3.7.3.我使用的是 spark 版本 2.4.4 和 python 版本 3.7.3。

Answer 1

Johanrex,约翰雷克斯，

Here's a piece of code :这是一段代码：

from pyspark.sql.functions import *

df.groupBy("order_id").agg(
    sum(col("nr_of_items")*col("price")).alias("total_order_amount"),
    sum(when(col("is_black") == lit(1), col("price")*col("nr_of_items")).otherwise(lit(0))).alias("black_order_amount"),
    sum(when(col("is_fabric") == lit(1), col("price")*col("nr_of_items")).otherwise(lit(0))).alias("fabric_order_amount")
).limit(100).toPandas()

Output :输出：

order_id    total_order_amount  black_order_amount  fabric_order_amount
1               50                  20                  20
2               90                  50                  90

如何使用pyspark有条件地对几列求和？

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-12-03 18:24:07

如何使用pyspark有条件地对几列求和？

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-12-03 18:24:07

解决方案1
0 已采纳 2019-12-03 18:24:07