Spark SQL懒惰计数

Question

I need to use a dataframe count as divisor for calculating percentages. 我需要使用数据帧计数作为除数来计算百分比。

This is what I'm doing: 这就是我正在做的事情：

scala> val df = Seq(1,1,1,2,2,3).toDF("value")
scala> val overallCount = df.count
scala> df.groupBy("value")
         .agg( count(lit(1)) / overallCount )

But I would like to avoid the action df.count as it will be evaluated immediately. 但我想避免动作df.count因为它将立即进行评估。

Accumulators won't help as they will be evaluated in advance. 累积器无法提供帮助，因为它们将提前进行评估。

Is there a way to perform a lazy count over a dataframe? 有没有办法在数据帧上执行惰性计数？

Answer 1

Instead of using Dataset.count you can use simple query 您可以使用简单查询，而不是使用Dataset.count

val overallCount = df.select(count($"*") as "overallCount")

and later crossJoin 然后crossJoin加入

df
  .groupBy("value")
  .agg(count(lit(1)) as "groupCount")
  .crossJoin(overallCount)
  .select($"value", $"groupCount" / $"overallCount")

Spark SQL懒惰计数

问题描述

1 个解决方案

解决方案1
3 2019-04-11 08:34:07

Spark SQL懒惰计数

问题描述

1 个解决方案

解决方案1 3 2019-04-11 08:34:07

解决方案1
3 2019-04-11 08:34:07