简体   繁体   English


[英]Make RDD from List in scala&spark

Orgin data 原始数据

A, John, 1, 3
A, Bob, 2, 5
A, Sam, 3, 1
B, Kim, 1, 4
B, John, 2, 3
B, Ria, 3, 5

To mak ID group list, I did below 要制作ID组列表,我在下面做了

val MapRDD = originDF.map { x => (x.getAs[String](colMap.ID), List(x)) }
val ListRDD = MapRDD.reduceByKey { (a: List[Row], b: List[Row]) => List(a, b).flatten }

My goal is making this RDD (purpose is to find SEQ-1's NAME and Number diff in each ID group) 我的目标是制作此RDD(目的是在每个ID组中查找SEQ-1的NAME和Number diff)

A, John, 1, 3, NULL, NULL
A, Bob, 2, 5, John, 2
A, Sam, 3, 1, Bob, -4
B, Kim, 1, 4, NULL, NULL
B, John, 2, 3, Kim, -1
B, Ria, 3, 5, John, 2

Currently ListRDD would be like 目前ListRDD就像

A, ([A,Jone,1,3], [A,Bob,2,5], ..)
B, ([B,Kim,1,4], [B,John,2,3], ..)

This is code I tried to make my goal RDD with ListRDD (not working as I want) 这是我尝试使用ListRDD使目标RDD的代码(无法正常运行)

  def myFunction(ListRDD: RDD[(String, List[Row])]) = {
    var rows: List[Row] = Nil
    ListRDD.foreach( row => { 
        rows ::: make(row._2)
    //rows has nothing and It's not RDD

  def make( eachList: List[Row]): List[Row] = {
      caseList.foreach { x => //... Make PRE_NAME and DIFF in new List

My final goal is to save this RDD in csv (RDD.saveAsFile...). 我的最终目标是将此RDD保存在csv(RDD.saveAsFile ...)中。 How to make this RDD(not list) with this data. 如何使用此数据制作此RDD(未列出)。

Window functions look like a good fit here: 窗口函数看起来很合适:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag

val df = sc.parallelize(Seq(
    ("A", "John", 1, 3),
    ("A", "Bob", 2, 5),
    ("A", "Sam", 3, 1),
    ("B", "Kim", 1, 4),
    ("B", "John", 2, 3),
    ("B", "Ria", 3, 5))).toDF("ID", "NAME", "SEQ", "NUMBER")

val w = Window.partitionBy($"ID").orderBy($"SEQ")

  lag($"NAME", 1).over(w).alias("PREV_NAME"),
  ($"NUMBER" - lag($"NUMBER", 1).over(w)).alias("DIFF"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM