简体   繁体   English

根据条件计算不同的列值 pyspark

[英]Count distinct column values based on condition pyspark

I have a column with 2 possible values: 'users' or 'not_users'我有一个包含 2 个可能值的列:'users' 或 'not_users'

What I want to do is to countDistinct values when those values are 'users'我想要做的是在这些值是“用户”时计算不同的值

This is the code I'm using:这是我正在使用的代码:

output = (df
           .withColumn('week', f.expr('DATE_FORMAT(DATE_SUB(registration_date, 1), "Y-ww")'))
           .groupby('week') 
           .agg(f.countDistinct('customer_id').alias('count_total_users'),
                f.countDistinct('vegetables_customers').alias('count_vegetable_users')   
     
               )
         
          )

display(output)

this is the output (not desired):这是 output(不需要):

Week        count_total_users      count_vegetable_users
2020-40            2345                        2
2020-41            5678                        2
2020-42            3345                        2
2020-43            5689                        2

desired output:所需 output:

Week        count_total_users      count_vegetable_users
2020-40            2345                        457
2020-41            5678                        1987
2020-42            3345                        2308
2020-43            5689                        4000

This desired output should be the count distinct for 'users' values inside the column it belongs to.这个所需的 output 应该是它所属列内“用户”值的不同计数。

Any clue?有什么线索吗?

Is df2 the result that you want? df2是你想要的结果吗?

df.show()
+----+-----------+--------------------+
|week|customer_id|vegetables_customers|
+----+-----------+--------------------+
|   1|          1|               users|
|   1|          2|           not_users|
|   1|          3|               users|
|   2|          1|           not_users|
|   2|          2|           not_users|
|   2|          3|               users|
+----+-----------+--------------------+

df2 = df.groupBy('week').agg(
    F.countDistinct('customer_id').alias('count_total_users'),
    F.countDistinct(
        F.when(
            F.col('vegetables_customers') == 'users', 
            F.col('customer_id')
        )
    ).alias('count_vegetable_users')
)

df2.show()
+----+-----------------+---------------------+
|week|count_total_users|count_vegetable_users|
+----+-----------------+---------------------+
|   1|                3|                    2|
|   2|                3|                    1|
+----+-----------------+---------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM