[英]Spark SQL lazy count
I need to use a dataframe count as divisor for calculating percentages. 我需要使用数据帧计数作为除数来计算百分比。
This is what I'm doing: 这就是我正在做的事情:
scala> val df = Seq(1,1,1,2,2,3).toDF("value")
scala> val overallCount = df.count
scala> df.groupBy("value")
.agg( count(lit(1)) / overallCount )
But I would like to avoid the action df.count
as it will be evaluated immediately. 但我想避免动作
df.count
因为它将立即进行评估。
Accumulators won't help as they will be evaluated in advance. 累积器无法提供帮助,因为它们将提前进行评估。
Is there a way to perform a lazy count over a dataframe? 有没有办法在数据帧上执行惰性计数?
Instead of using Dataset.count
you can use simple query 您可以使用简单查询,而不是使用
Dataset.count
val overallCount = df.select(count($"*") as "overallCount")
and later crossJoin
然后
crossJoin
加入
df
.groupBy("value")
.agg(count(lit(1)) as "groupCount")
.crossJoin(overallCount)
.select($"value", $"groupCount" / $"overallCount")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.