[英]in Scala how do you join 2 RDD
如果我有2個RDD定義為:
Sample(Key1,EventDate,Value1)
Sample2(Key1,ExecutionDate, Label1)
我想加入兩個RDD,以便確定Key1是否存在於Sample2中,然后將完整結果分為2個新RDD:1個包含Key1存在於Sample2中的鍵,另一個包含所有Key1,而其中不存在在Sample2中
FoundKey1(Key1, EventDate,Value1)
NotFoundKey1(Key1, ExecutionDate,Label1)
從本質上講,我想獲得在SQL中執行的類似操作:
Select Sample.Key1, Sample.EventDate. Key1.Value
from Sample
where NOT EXISTS (select 1 from Sample2 where Sample2.Key1 = Sample.Key1)
還有另一張桌子
SELECT Sample.Key1, Sample.EventDate, Sample.Value1
from Sample right join Sample2
on (Sample.Key1 = Sample2.Key2);
樣本RDD值:
Sample(1, 2016-01-05, 10)
Sample(1, 2016-01-05, 10)
Sample(2, 2016-01-05, 10)
Sample(2, 2016-01-05, 10)
Sample(3, 2016-01-05, 10)
Sample(1, 2016-01-05, A)
Sample(3, 2016-01-05, A)
Sample(5, 2016-01-05, B)
Sample(6, 2016-01-05, C)
Sample(7, 2016-01-05, C)
在我忘記之前,我的RDD被定義為RDD [Iterable [TestData]],而TestData是一個類,其Sample和TestData2的值是(Key1,EventDate,Value)=(Key1,ExecutionDate,Label)
到目前為止,這是我嘗試過的:
val grpSample.groupBy(_.Key1).map(_._2)
val grpSample2.groupBy(_.Key2).map(_._2)
val interSect = grpSample.intersection.grpSample2
我運行此代碼以查看是否將其分組並且收到錯誤消息
val rdd1=sample.groupBy(_.Key1)
val rdd2=sample2.groupBy(_.key1)
//to get data for which key exists in both rdd
val result1= rdd1 join rdd2 map (_._2)
//to get data for which key exists in first but not in second rdd
val tempresult= rdd1 fullOuterJoin rdd2
val result2= tempresult filter(_._2._2.isEmpty) map (_._2._1.get)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.