[英]Count distinct column values based on condition pyspark
我有一個包含 2 個可能值的列:'users' 或 'not_users'
我想要做的是在這些值是“用戶”時計算不同的值
這是我正在使用的代碼:
output = (df
.withColumn('week', f.expr('DATE_FORMAT(DATE_SUB(registration_date, 1), "Y-ww")'))
.groupby('week')
.agg(f.countDistinct('customer_id').alias('count_total_users'),
f.countDistinct('vegetables_customers').alias('count_vegetable_users')
)
)
display(output)
這是 output(不需要):
Week count_total_users count_vegetable_users
2020-40 2345 2
2020-41 5678 2
2020-42 3345 2
2020-43 5689 2
所需 output:
Week count_total_users count_vegetable_users
2020-40 2345 457
2020-41 5678 1987
2020-42 3345 2308
2020-43 5689 4000
這個所需的 output 應該是它所屬列內“用戶”值的不同計數。
有什么線索嗎?
df2
是你想要的結果嗎?
df.show()
+----+-----------+--------------------+
|week|customer_id|vegetables_customers|
+----+-----------+--------------------+
| 1| 1| users|
| 1| 2| not_users|
| 1| 3| users|
| 2| 1| not_users|
| 2| 2| not_users|
| 2| 3| users|
+----+-----------+--------------------+
df2 = df.groupBy('week').agg(
F.countDistinct('customer_id').alias('count_total_users'),
F.countDistinct(
F.when(
F.col('vegetables_customers') == 'users',
F.col('customer_id')
)
).alias('count_vegetable_users')
)
df2.show()
+----+-----------------+---------------------+
|week|count_total_users|count_vegetable_users|
+----+-----------------+---------------------+
| 1| 3| 2|
| 2| 3| 1|
+----+-----------------+---------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.