计算 Dataframe Pyspark 中的行数

Question

I want to make some checks on my DF, in order to try it I'm using the following code:我想对我的 DF 进行一些检查，为了尝试它，我正在使用以下代码：

start = '2020-12-10'
end = datetime.date.today()
country='gb'


df_ua = (spark.table(f'nn_squad7_{country}.fact_table')
      .filter(f.col('date_key').between(start,end))
      #.filter(f.col('is_client')==1)
      .filter(f.col('source')=='tickets')
      .filter(f.col('subtype')=='trx')
      .filter(f.col('is_trx_ok') == 1) 
      .select('ticket_id').distinct() 
      )

output = df_ua.count('ticket_id').distinct()

I'm getting the following error:我收到以下错误：

TypeError: count() takes 1 positional argument but 2 were given TypeError：count() 接受 1 个位置参数，但给出了 2 个

I don't understand why I'm getting it, any clue?我不明白为什么我得到它，任何线索？

Answer 1

Just doing df_ua.count() is enough, because you have selected distinct ticket_id in the lines above.只需执行df_ua.count()就足够了，因为您在上面的行中选择了不同的ticket_id 。

df.count() returns the number of rows in the dataframe. df.count()返回 dataframe 中的行数。 It does not take any parameters, such as column names.它不带任何参数，例如列名。 Also it returns an integer - you can't call distinct on an integer.它还返回 integer - 您不能在 integer 上调用distinct 。

Answer 2

maybe you can try this instead:也许你可以试试这个：

import pyspark.sql.functions as f

start = '2020-12-10'
end = datetime.date.today()
country = 'gb'


df_ua = (spark.table(f'nn_squad7_{country}.fact_table')
      .filter(f.col('date_key').between(start, end))
      #.filter(f.col('is_client')==1)
      .filter(f.col('source')=='tickets')
      .filter(f.col('subtype')=='trx')
      .filter(f.col('is_trx_ok') == 1) 
      .select('ticket_id').distinct() 
      )

output = df_ua.count()

计算 Dataframe Pyspark 中的行数

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-12-28 13:05:26

解决方案2
0 2020-12-28 13:09:32

计算 Dataframe Pyspark 中的行数

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-12-28 13:05:26

解决方案2 0 2020-12-28 13:09:32

解决方案1
0 已采纳 2020-12-28 13:05:26

解决方案2
0 2020-12-28 13:09:32