f.當 output 與 Pyspark 一起計數時

Question

我想從我加入的 2 個表中按月獲取注冊用戶總數和已識別用戶總數。 請參閱所需的 output：

Month  reg_users  iden_users
Jan       300        600
Feb       250        500
Mar       100        200

但我得到一個錯誤：

when() 缺少 1 個必需的位置參數：“值”

使用的代碼：

#registered vs identified
dim_customers = (spark.table(f'nn_squad7_{country}.dim_customers')
                 .filter(f.col('registration_date').between(start,end))
                 .withColumn('month', f.date_format(f.date_sub(f.col('registration_date'), 1), 'MMM'))
                 .selectExpr('customer_id','age','gender','registration_date','month','1 as registered')
                )

df = (
      spark.table(f'nn_squad7_{country}.fact_table')
     .filter(f.col('date_key').between(start,end))
     .filter(f.col('is_client_plus')==1)
     .filter(f.col('source')=='tickets')
     .filter(f.col('subtype')=='trx')
     .filter(f.col('is_trx_ok') == 1) 
     .withColumn('week', f.date_format(f.date_sub(f.col('date_key'), 1), 'YYYY-ww'))
     .withColumn('month', f.date_format(f.date_sub(f.col('date_key'), 1), 'MMM'))
     .selectExpr('customer_id','1 as identified','date_key')
     )

output2 = (dim_customers
          .join(df,'customer_id','left')
          .fillna(0, subset=['identified'])
          .withColumn('month', f.date_format(f.date_sub(f.col('date_key'), 1), 'MMM'))
          .groupby('month')
          .agg(f.countDistinct('customer_id').alias('reg_users'),
               )
          .withColumn('iden_users',f.when((f.col('identified')==1)))
          )

display(output2)

知道為什么我會收到此錯誤嗎？ 解決方案可以進行 2 次查詢？ 我的想法是連接表並在一個查詢中一起完成所有操作。

Answer 1

我猜您想獲得已identified = 1的客戶 ID 的不同計數。 您可以在聚合期間使用when進行條件計數：

output2 = (dim_customers
          .join(df,'customer_id','left')
          .fillna(0, subset=['identified'])
          .withColumn('month', f.date_format(f.date_sub(f.col('date_key'), 1), 'MMM'))
          .groupby('month')
          .agg(f.countDistinct('customer_id').alias('reg_users'),
               f.countDistinct(
                   f.when(
                       (f.col('identified')==1),
                       f.col('customer_id')
                   )
               ).alias('iden_users')
           )
          )

f.當 output 與 Pyspark 一起計數時

問題描述

1 個解決方案

解決方案1
1 已采納 2021-02-03 12:15:35

f.當 output 與 Pyspark 一起計數時

問題描述

1 個解決方案

解決方案1 1 已采納 2021-02-03 12:15:35

解決方案1
1 已采納 2021-02-03 12:15:35