繁体   English   中英

Spark 数据集连接和聚合列

[英]Spark Dataset Join and Aggregate columns

我有三个相同类型 A 的 Spark 数据集

case class A(col_a: String, col_b: Int, col_c: Int, col_d: Int, col_e: Int)

val ds_one = Dataset[A](Seq(a, 12, 0, 0, 0), Seq(b, 11, 0, 0, 0))
val ds_two = Dataset[A](Seq(a, 0, 16, 0, 0),  Seq(b, 0, 73, 0, 0))
val ds_three = Dataset[A](Seq(a, 0, 0, 9, 0),  Seq(b, 0, 0, 64, 0))

如何将三个数据集缩减为一个 Dataset[A]:

ds_combined = Dataset[A](Seq(a,12,16,9,0), Seq(b,11,73,64,0))

看起来您只是按col_a分组并获得最大值

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
case class A(col_a: String, col_b: Int, col_c: Int, col_d: Int, col_e: Int)

val ds_one = Seq(A("a", 12, 0, 0, 0), A("b", 11, 0, 0, 0)).toDS
val ds_two = Seq(A("a", 0, 16, 0, 0), A("b", 0, 73, 0, 0)).toDS
val ds_three = Seq(A("a", 0, 0, 9, 0), A("b", 0, 0, 64, 0)).toDS

val ds_union = ds_one.union(ds_two).union(ds_three)
val ds_combined = ds_union
  .groupBy("col_a")
  .agg(max("col_b").alias("col_b")
    , max("col_c").alias("col_c")
    , max("col_d").alias("col_d")
    , max("col_e").alias("col_e"))
  .as[A]



ds_combined.show

ds_combined:org.apache.spark.sql.Dataset[A]

+-----+-----+-----+-----+-----+
|col_a|col_b|col_c|col_d|col_e|
+-----+-----+-----+-----+-----+
|    b|   11|   73|   64|    0|
|    a|   12|   16|    9|    0|
+-----+-----+-----+-----+-----+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM