根據條件計算不同的列值 pyspark

Question

我有一個包含 2 個可能值的列：'users' 或 'not_users'

我想要做的是在這些值是“用戶”時計算不同的值

這是我正在使用的代碼：

output = (df
           .withColumn('week', f.expr('DATE_FORMAT(DATE_SUB(registration_date, 1), "Y-ww")'))
           .groupby('week') 
           .agg(f.countDistinct('customer_id').alias('count_total_users'),
                f.countDistinct('vegetables_customers').alias('count_vegetable_users')   
     
               )
         
          )

display(output)

這是 output（不需要）：

Week        count_total_users      count_vegetable_users
2020-40            2345                        2
2020-41            5678                        2
2020-42            3345                        2
2020-43            5689                        2

所需 output：

Week        count_total_users      count_vegetable_users
2020-40            2345                        457
2020-41            5678                        1987
2020-42            3345                        2308
2020-43            5689                        4000

這個所需的 output 應該是它所屬列內“用戶”值的不同計數。

有什么線索嗎？

Answer 1

df2是你想要的結果嗎？

df.show()
+----+-----------+--------------------+
|week|customer_id|vegetables_customers|
+----+-----------+--------------------+
|   1|          1|               users|
|   1|          2|           not_users|
|   1|          3|               users|
|   2|          1|           not_users|
|   2|          2|           not_users|
|   2|          3|               users|
+----+-----------+--------------------+

df2 = df.groupBy('week').agg(
    F.countDistinct('customer_id').alias('count_total_users'),
    F.countDistinct(
        F.when(
            F.col('vegetables_customers') == 'users', 
            F.col('customer_id')
        )
    ).alias('count_vegetable_users')
)

df2.show()
+----+-----------------+---------------------+
|week|count_total_users|count_vegetable_users|
+----+-----------------+---------------------+
|   1|                3|                    2|
|   2|                3|                    1|
+----+-----------------+---------------------+

根據條件計算不同的列值 pyspark

問題描述

1 個解決方案

解決方案1
0 已采納 2020-12-23 13:56:56

根據條件計算不同的列值 pyspark

問題描述

1 個解決方案

解決方案1 0 已采納 2020-12-23 13:56:56

解決方案1
0 已采納 2020-12-23 13:56:56