简体   繁体   中英

Expand expression in Spark Scala aggregation

I'm trying to convert a simple aggregation code from PySpark to Scala.

The dataframes:

# PySpark
from pyspark.sql import functions as F
df = spark.createDataFrame(
    [([10, 100],),
     ([20, 200],)],
    ['vals'])
// Scala
val df = Seq(
    (Seq(10, 100)),
    (Seq(20, 200)),
).toDF("vals")

Aggregation expansion - OK in PySpark:

df2 = df.agg(
    *[F.sum(F.col("vals")[i]).alias(f"col{i}") for i in range(2)]
)
df2.show()
# +----+----+
# |col0|col1|
# +----+----+
# |  30| 300|
# +----+----+

But in Scala...

val df2 = df.agg(
  (0 until 2).map(i => sum($"vals"(i)).alias(s"col$i")): _*
)
 (0 until 2).map(i => sum($"vals"(i)).alias(s"col$i")): _* ^ On line 2: error: no `: _*` annotation allowed here (such annotations are only allowed in arguments to *-parameters)

The syntax seems almost the same to this select which works well:

val df2 = df.select(
  (0 until 2).map(i => $"vals"(i).alias(s"col$i")): _*
)

Does expression expansion work in Scala Spark aggregations? How?

i'm not fully understanding why this is happening for the compiler but it seems that it is not unpacking your Seq[Column] to vararg as params.

as @RvdV has mentioned in his post, the signature of the method is def agg(expr: Column, exprs: Column*): DataFrame

so a temp solution is you unpack it manually, like:

val seq = Seq(0, 1).map(i => sum($"vals"(i)).alias(s"col$i"))
val df2 = df.agg(seq(0), seq(1))

If you look at the documentation of Dataset.agg , you see that it first has a fixed parameter and then a list of unspecified length:

def agg(expr: Column, exprs: Column*): DataFrame 

So you should first have any other aggregation, then for the second argument you can do the list expansion. So something like

val df2 = df.agg(
  first($"vals"), (0 until 2).map(i => sum($"vals"(i)).alias(s"col$i")): _*
)

or any other single aggregation in front of the list should work. I don't know why it is like this, maybe it's a Scala limitation so you can't pass an empty list and have no aggregation at all?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM