Spark Scala根據另一個RDD的列刪除一個RDD中的行

Question

我對scala和spark非常陌生，不確定如何開始。

我有一個RDD看起來像這樣：

1,2,3,11
2,1,4,12
1,4,5,13
3,5,6,12

另一個看起來像這樣：

2,1
1,2

我想過濾第一個RDD，以便它將刪除與第二個RDD的前兩列匹配的所有行。 輸出應如下所示：

 1,4,5,13
 3,5,6,12

Answer 1

// input rdds
val rdd1 = spark.sparkContext.makeRDD(Seq((1,2,3,11), (2,1,3,12), (1,4,5,13), (3,5,6,12)))
val rdd2 = spark.sparkContext.makeRDD(Seq((1,2), (2,1)))

// manipulate the 2 rdds as a key, val pair
// the key of the first rdd is a tuple pair of first two fields, the val contains all the fields
// the key of the second rdd is a tuple of first two fields, the val is just null
// then we could perform joins on their key
val rdd1_key = rdd1.map(record => ((record._1, record._2), record))
val rdd2_key = rdd2.map(record => (record, null))

// 1. perform left outer join, the record become (key, (val1, val2))
// 2. filter, keep those records which do not have a join
// if there is no join, val2 will be None, otherwise val2 will be null, which is the value we hardcoded from previous step
// 3. get val1 
rdd1_key.leftOuterJoin(rdd2_key)
  .filter(record => record._2._2 == None)
  .map(record => record._2._1)
  .collect().foreach(println(_))

// result
(1,4,5,13)
(3,5,6,12)

謝謝

Answer 2

我個人更喜歡dataframe/dataset方式，因為它們是rdd優化形式，具有更多內置函數，並且與傳統數據庫相似。

以下是dataframe方式：

第一步是將兩個rdds都轉換為dataframes

import sqlContext.implicits._
val df1 = rdd1.toDF("col1", "col2", "col3", "col4")
val df2 = rdd2.toDF("col1", "col2")

第二步是在dataframe2添加新column以過濾條件檢查

import org.apache.spark.sql.functions._
val tempdf2 = df2.withColumn("check", lit("check"))

最后一步將是join兩個dataframes ， filter和drop了不必要的rows和columns 。

val finalDF = df1.join(tempdf2, Seq("col1", "col2"), "left")
                          .filter($"check".isNull)
                          .drop($"check")

您應該具有最終dataframe為

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|3   |5   |6   |12  |
|1   |4   |5   |13  |
+----+----+----+----+

現在，您可以使用finalDF.rdd轉換為rdd ，也可以繼續使用dataframe本身進行進一步處理。

我希望答案是有幫助的

Spark Scala根據另一個RDD的列刪除一個RDD中的行

問題描述

2 個解決方案

解決方案1
1 已采納 2017-10-19 21:48:09

解決方案2
1 2017-10-20 00:39:11

Spark Scala根據另一個RDD的列刪除一個RDD中的行

問題描述

2 個解決方案

解決方案1 1 已采納 2017-10-19 21:48:09

解決方案2 1 2017-10-20 00:39:11

解決方案1
1 已采納 2017-10-19 21:48:09

解決方案2
1 2017-10-20 00:39:11