根据条件计算不同的列值 pyspark

Question

我有一个包含 2 个可能值的列：'users' 或 'not_users'

我想要做的是在这些值是“用户”时计算不同的值

这是我正在使用的代码：

output = (df
           .withColumn('week', f.expr('DATE_FORMAT(DATE_SUB(registration_date, 1), "Y-ww")'))
           .groupby('week') 
           .agg(f.countDistinct('customer_id').alias('count_total_users'),
                f.countDistinct('vegetables_customers').alias('count_vegetable_users')   
     
               )
         
          )

display(output)

这是 output（不需要）：

Week        count_total_users      count_vegetable_users
2020-40            2345                        2
2020-41            5678                        2
2020-42            3345                        2
2020-43            5689                        2

所需 output：

Week        count_total_users      count_vegetable_users
2020-40            2345                        457
2020-41            5678                        1987
2020-42            3345                        2308
2020-43            5689                        4000

这个所需的 output 应该是它所属列内“用户”值的不同计数。

有什么线索吗？

Answer 1

df2是你想要的结果吗？

df.show()
+----+-----------+--------------------+
|week|customer_id|vegetables_customers|
+----+-----------+--------------------+
|   1|          1|               users|
|   1|          2|           not_users|
|   1|          3|               users|
|   2|          1|           not_users|
|   2|          2|           not_users|
|   2|          3|               users|
+----+-----------+--------------------+

df2 = df.groupBy('week').agg(
    F.countDistinct('customer_id').alias('count_total_users'),
    F.countDistinct(
        F.when(
            F.col('vegetables_customers') == 'users', 
            F.col('customer_id')
        )
    ).alias('count_vegetable_users')
)

df2.show()
+----+-----------------+---------------------+
|week|count_total_users|count_vegetable_users|
+----+-----------------+---------------------+
|   1|                3|                    2|
|   2|                3|                    1|
+----+-----------------+---------------------+

根据条件计算不同的列值 pyspark

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-12-23 13:56:56

根据条件计算不同的列值 pyspark

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-12-23 13:56:56

解决方案1
0 已采纳 2020-12-23 13:56:56