简体   繁体   中英

Spark SQL lazy count

I need to use a dataframe count as divisor for calculating percentages.

This is what I'm doing:

scala> val df = Seq(1,1,1,2,2,3).toDF("value")
scala> val overallCount = df.count
scala> df.groupBy("value")
         .agg( count(lit(1)) / overallCount )

But I would like to avoid the action df.count as it will be evaluated immediately.

Accumulators won't help as they will be evaluated in advance.

Is there a way to perform a lazy count over a dataframe?

Instead of using Dataset.count you can use simple query

val overallCount = df.select(count($"*") as "overallCount")

and later crossJoin

df
  .groupBy("value")
  .agg(count(lit(1)) as "groupCount")
  .crossJoin(overallCount)
  .select($"value", $"groupCount" / $"overallCount")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM