在Scala中，您如何加入2 RDD

Question

如果我有2個RDD定義為：

Sample(Key1,EventDate,Value1) 
Sample2(Key1,ExecutionDate, Label1)

我想加入兩個RDD，以便確定Key1是否存在於Sample2中，然后將完整結果分為2個新RDD：1個包含Key1存在於Sample2中的鍵，另一個包含所有Key1，而其中不存在在Sample2中

FoundKey1(Key1, EventDate,Value1) 
NotFoundKey1(Key1, ExecutionDate,Label1)

從本質上講，我想獲得在SQL中執行的類似操作：

 Select Sample.Key1, Sample.EventDate. Key1.Value
 from Sample
 where NOT EXISTS (select 1 from Sample2 where Sample2.Key1 = Sample.Key1)

還有另一張桌子

 SELECT Sample.Key1, Sample.EventDate, Sample.Value1
 from Sample right join Sample2
 on (Sample.Key1 = Sample2.Key2);

樣本RDD值：

  Sample(1, 2016-01-05, 10)
  Sample(1, 2016-01-05, 10)
  Sample(2, 2016-01-05, 10)
  Sample(2, 2016-01-05, 10)
  Sample(3, 2016-01-05, 10)

  Sample(1, 2016-01-05, A)
  Sample(3, 2016-01-05, A)
  Sample(5, 2016-01-05, B)
  Sample(6, 2016-01-05, C)
  Sample(7, 2016-01-05, C)

在我忘記之前，我的RDD被定義為RDD [Iterable [TestData]]，而TestData是一個類，其Sample和TestData2的值是（Key1，EventDate，Value）=（Key1，ExecutionDate，Label）

到目前為止，這是我嘗試過的：

  val grpSample.groupBy(_.Key1).map(_._2)
  val grpSample2.groupBy(_.Key2).map(_._2)
  val interSect = grpSample.intersection.grpSample2

我運行此代碼以查看是否將其分組並且收到錯誤消息

Answer 1

val rdd1=sample.groupBy(_.Key1)
val rdd2=sample2.groupBy(_.key1)

//to get data for which key exists in both rdd
val result1= rdd1 join rdd2 map (_._2)

//to get data for which key exists in first but not in second rdd
val tempresult= rdd1 fullOuterJoin rdd2
val result2= tempresult filter(_._2._2.isEmpty) map (_._2._1.get)

在Scala中，您如何加入2 RDD

問題描述

1 個解決方案

解決方案1
0 已采納 2017-02-10 10:43:26

在Scala中，您如何加入2 RDD

問題描述

1 個解決方案

解決方案1 0 已采納 2017-02-10 10:43:26

解決方案1
0 已采納 2017-02-10 10:43:26