简体   繁体   English

计算 Dataframe Pyspark 中的行数

[英]count rows in Dataframe Pyspark

I want to make some checks on my DF, in order to try it I'm using the following code:我想对我的 DF 进行一些检查,为了尝试它,我正在使用以下代码:

start = '2020-12-10'
end = datetime.date.today()
country='gb'


df_ua = (spark.table(f'nn_squad7_{country}.fact_table')
      .filter(f.col('date_key').between(start,end))
      #.filter(f.col('is_client')==1)
      .filter(f.col('source')=='tickets')
      .filter(f.col('subtype')=='trx')
      .filter(f.col('is_trx_ok') == 1) 
      .select('ticket_id').distinct() 
      )

output = df_ua.count('ticket_id').distinct()

I'm getting the following error:我收到以下错误:

TypeError: count() takes 1 positional argument but 2 were given TypeError:count() 接受 1 个位置参数,但给出了 2 个

I don't understand why I'm getting it, any clue?我不明白为什么我得到它,任何线索?

Just doing df_ua.count() is enough, because you have selected distinct ticket_id in the lines above.只需执行df_ua.count()就足够了,因为您在上面的行中选择了不同的ticket_id

df.count() returns the number of rows in the dataframe. df.count()返回 dataframe 中的行数。 It does not take any parameters, such as column names.它不带任何参数,例如列名。 Also it returns an integer - you can't call distinct on an integer.它还返回 integer - 您不能在 integer 上调用distinct

maybe you can try this instead:也许你可以试试这个:

import pyspark.sql.functions as f

start = '2020-12-10'
end = datetime.date.today()
country = 'gb'


df_ua = (spark.table(f'nn_squad7_{country}.fact_table')
      .filter(f.col('date_key').between(start, end))
      #.filter(f.col('is_client')==1)
      .filter(f.col('source')=='tickets')
      .filter(f.col('subtype')=='trx')
      .filter(f.col('is_trx_ok') == 1) 
      .select('ticket_id').distinct() 
      )

output = df_ua.count()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM