[英]count rows in Dataframe Pyspark
I want to make some checks on my DF, in order to try it I'm using the following code:我想对我的 DF 进行一些检查,为了尝试它,我正在使用以下代码:
start = '2020-12-10'
end = datetime.date.today()
country='gb'
df_ua = (spark.table(f'nn_squad7_{country}.fact_table')
.filter(f.col('date_key').between(start,end))
#.filter(f.col('is_client')==1)
.filter(f.col('source')=='tickets')
.filter(f.col('subtype')=='trx')
.filter(f.col('is_trx_ok') == 1)
.select('ticket_id').distinct()
)
output = df_ua.count('ticket_id').distinct()
I'm getting the following error:我收到以下错误:
TypeError: count() takes 1 positional argument but 2 were given TypeError:count() 接受 1 个位置参数,但给出了 2 个
I don't understand why I'm getting it, any clue?我不明白为什么我得到它,任何线索?
Just doing df_ua.count()
is enough, because you have selected distinct ticket_id
in the lines above.只需执行df_ua.count()
就足够了,因为您在上面的行中选择了不同的ticket_id
。
df.count()
returns the number of rows in the dataframe. df.count()
返回 dataframe 中的行数。 It does not take any parameters, such as column names.它不带任何参数,例如列名。 Also it returns an integer - you can't call distinct
on an integer.它还返回 integer - 您不能在 integer 上调用distinct
。
maybe you can try this instead:也许你可以试试这个:
import pyspark.sql.functions as f
start = '2020-12-10'
end = datetime.date.today()
country = 'gb'
df_ua = (spark.table(f'nn_squad7_{country}.fact_table')
.filter(f.col('date_key').between(start, end))
#.filter(f.col('is_client')==1)
.filter(f.col('source')=='tickets')
.filter(f.col('subtype')=='trx')
.filter(f.col('is_trx_ok') == 1)
.select('ticket_id').distinct()
)
output = df_ua.count()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.