[英]Spark Scala: Aggregate DataFrame Column Values into a Ordered List
I have a spark scala DataFrame that has four values: (id, day, val, order). 我有一个spark scala DataFrame有四个值:(id,day,val,order)。 I want to create a new DataFrame with columns: (id, day, value_list: List(val1, val2, ..., valn)) where val1, through valn are ordered by asc order value. 我想用列创建一个新的DataFrame:(id,day,value_list:List(val1,val2,...,valn))其中val1到valn按asc顺序值排序。
For instance: 例如:
(50, 113, 1, 1),
(50, 113, 1, 3),
(50, 113, 2, 2),
(51, 114, 1, 2),
(51, 114, 2, 1),
(51, 113, 1, 1)
would become: 会成为:
((51,113),List(1))
((51,114),List(2, 1)
((50,113),List(1, 2, 1))
I'm close, but don't know what to do after I've aggregated the data into a list. 我很接近,但在将数据汇总到列表后不知道该怎么办。 I'm not sure how to then have spark order each value list by the order int: 我不知道如何通过int命令对每个值列表进行spark命令:
import org.apache.spark.sql.Row
val testList = List((50, 113, 1, 1), (50, 113, 1, 3), (50, 113, 2, 2), (51, 114, 1, 2), (51, 114, 2, 1), (51, 113, 1, 1))
val testDF = sqlContext.sparkContext.parallelize(testList).toDF("id1", "id2", "val", "order")
val rDD1 = testDF.map{case Row(key1: Int, key2: Int, val1: Int, val2: Int) => ((key1, key2), List((val1, val2)))}
val rDD2 = rDD1.reduceByKey{case (x, y) => x ++ y}
where the output looks like: 输出如下:
((51,113),List((1,1)))
((51,114),List((1,2), (2,1)))
((50,113),List((1,3), (1,1), (2,2)))
The next step would be to produce: 下一步是产生:
((51,113),List((1,1)))
((51,114),List((2,1), (1,2)))
((50,113),List((1,1), (2,2), (1,3)))
You will just need to map over your RDD
and use sortBy
: 您只需要映射RDD
并使用sortBy
:
scala> val df = Seq((50, 113, 1, 1), (50, 113, 1, 3), (50, 113, 2, 2), (51, 114, 1, 2), (51, 114, 2, 1), (51, 113, 1, 1)).toDF("id1", "id2", "val", "order")
df: org.apache.spark.sql.DataFrame = [id1: int, id2: int, val: int, order: int]
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rDD1 = df.map{case Row(key1: Int, key2: Int, val1: Int, val2: Int) => ((key1, key2), List((val1, val2)))}
rDD1: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = MapPartitionsRDD[10] at map at <console>:28
scala> val rDD2 = rDD1.reduceByKey{case (x, y) => x ++ y}
rDD2: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = ShuffledRDD[11] at reduceByKey at <console>:30
scala> val rDD3 = rDD2.map(x => (x._1, x._2.sortBy(_._2)))
rDD3: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = MapPartitionsRDD[12] at map at <console>:32
scala> rDD3.collect.foreach(println)
((51,113),List((1,1)))
((50,113),List((1,1), (2,2), (1,3)))
((51,114),List((2,1), (1,2)))
testDF.groupBy("id1","id2").agg(collect_list($"val")).show
+---+---+-----------------+
|id1|id2|collect_list(val)|
+---+---+-----------------+
| 51|113| [1]|
| 51|114| [1, 2]|
| 50|113| [1, 1, 2]|
+---+---+-----------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.