簡體   English   中英

在Scala中,您如何加入2 RDD

[英]in Scala how do you join 2 RDD

如果我有2個RDD定義為:

Sample(Key1,EventDate,Value1) 
Sample2(Key1,ExecutionDate, Label1) 

我想加入兩個RDD,以便確定Key1是否存在於Sample2中,然后將完整結果分為2個新RDD:1個包含Key1存在於Sample2中的鍵,另一個包含所有Key1,而其中不存在在Sample2中

FoundKey1(Key1, EventDate,Value1) 
NotFoundKey1(Key1, ExecutionDate,Label1)

從本質上講,我想獲得在SQL中執行的類似操作:

 Select Sample.Key1, Sample.EventDate. Key1.Value
 from Sample
 where NOT EXISTS (select 1 from Sample2 where Sample2.Key1 = Sample.Key1) 

還有另一張桌子

 SELECT Sample.Key1, Sample.EventDate, Sample.Value1
 from Sample right join Sample2
 on (Sample.Key1 = Sample2.Key2);

樣本RDD值:

  Sample(1, 2016-01-05, 10)
  Sample(1, 2016-01-05, 10)
  Sample(2, 2016-01-05, 10)
  Sample(2, 2016-01-05, 10)
  Sample(3, 2016-01-05, 10)

  Sample(1, 2016-01-05, A)
  Sample(3, 2016-01-05, A)
  Sample(5, 2016-01-05, B)
  Sample(6, 2016-01-05, C)
  Sample(7, 2016-01-05, C)

在我忘記之前,我的RDD被定義為RDD [Iterable [TestData]],而TestData是一個類,其Sample和TestData2的值是(Key1,EventDate,Value)=(Key1,ExecutionDate,Label)

到目前為止,這是我嘗試過的:

  val grpSample.groupBy(_.Key1).map(_._2)
  val grpSample2.groupBy(_.Key2).map(_._2)
  val interSect = grpSample.intersection.grpSample2

我運行此代碼以查看是否將其分組並且收到錯誤消息

val rdd1=sample.groupBy(_.Key1)
val rdd2=sample2.groupBy(_.key1)

//to get data for which key exists in both rdd
val result1= rdd1 join rdd2 map (_._2)

//to get data for which key exists in first but not in second rdd
val tempresult= rdd1 fullOuterJoin rdd2
val result2= tempresult filter(_._2._2.isEmpty) map (_._2._1.get)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM