如何在兩個不同的spark rdd中獲取所有不同的記錄

Question

spark和RDD非常新，所以我希望我能很好地解釋我所追求的東西，以便有人理解和幫助:)

我有兩組非常大的數據，假設 300 萬行 50 列存儲在 hadoop hdfs 中。 我想做的是將這兩個讀入RDD，以便它使用並行性&我想返回一個包含所有不匹配的記錄（來自任一RDD）的第三個RDD。

希望下面有助於顯示我想要做什么......只是試圖以最快最有效的方式找到所有不同的記錄......

數據不一定是相同的順序 - rdd1 的第 1 行可能是 rdd2 的第 4 行。

提前謝謝了！！

所以......這似乎正在做我想要的，但它似乎很容易正確......

%spark

import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
import sqlContext.implicits._
import org.apache.spark.sql._

//create the tab1 rdd.
val rdd1 = sqlContext.sql("select * FROM table1").withColumn("source",lit("tab1"))

//create the tab2 rdd.
val rdd2 = sqlContext.sql("select * FROM table2").withColumn("source",lit("tab2"))

//create the rdd of all misaligned records between table1 and the table2.
val rdd3 = rdd1.except(rdd2).unionAll(rdd2.except(rdd1))

//rdd3.printSchema()    
//val rdd3 = rdd1.except(rdd2)

//drop the temporary table that was used to create a hive compatible table from the last run.
sqlContext.dropTempTable("table3")

//register the new temporary table.
rdd3.toDF().registerTempTable("table3")

//drop the old compare table.
sqlContext.sql("drop table if exists data_base.compare_table")

//create the new version of the s_asset compare table.
sqlContext.sql("create table data_base.compare_table as select * from table3")

這是我到目前為止完成的最后一點代碼，它似乎正在完成這項工作 - 不確定完整數據集的性能，會讓我的手指交叉......

非常感謝所有花時間幫助這個可憐的平民:)

ps如果有人有性能更高的解決方案，我很想聽聽。 或者，如果您可以看到一些問題，這可能意味着它會返回錯誤的結果。

Answer 1

將您的兩個數據框加載為df1 ， df2
添加默認值分別為rdd1和rdd2的source列
聯合df1和df2
按"rowid", "name", "status", "lastupdated"並按集合收集其來源
過濾所有具有單一來源的行

import org.apache.spark.sql.functions._

object OuterJoin {

  def main(args: Array[String]): Unit = {

    val spark = Constant.getSparkSess

    import spark.implicits._

    val cols = Array("rowid", "name", "status", "lastupdated")

    val df1 = List(
      ("1-za23f0", "product1", "active", "30-12-2019"),
      ("1-za23f1", "product2", "inactive", "31-12-2019"),
      ("1-za23f2", "product3", "inactive", "01-01-2020"),
      ("1-za23f3", "product4", "inactive", "02-01-2020"),
      ("1-za23f4", "product5", "inactive", "03-01-2020"))
      .toDF(cols:_ *)
      .withColumn("source",lit("rdd1"))

    val df2 = List(
      ("1-za23f0", "product1", "active", "30-12-2019"),
      ("1-za23f1", "product2", "active", "31-12-2019"),
      ("1-za23f2", "product3", "active", "01-01-2020"),
      ("1-za23f3", "product1", "inactive", "02-01-2020"),
      ("1-za23f4", "product5", "inactive", "03-01-2020"))
      .toDF(cols:_ *)
        .withColumn("source",lit("rdd2"))

    df1.union(df2)
      .groupBy(cols.map(col):_ *)
      .agg(collect_set("source").as("sources"))
      .filter(size(col("sources")) === 1)
      .withColumn("from_rdd", explode(col("sources") ))
      .drop("sources")
      .show()
  }

}

Answer 2

您可以將數據讀入數據幀而不是 Rdds，然后使用 union 和 group by 來實現結果

Answer 3

兩者都可以與“full_outer”連接，然后應用過濾器，其中字段值在兩者中進行比較：

val filterCondition = cols
  .map(c => (col(s"l.$c") =!= col(s"r.$c") || col(s"l.$c").isNull || col(s"r.$c").isNull))
  .reduce((acc, c) => acc || c)

df1.alias("l")
  .join(df2.alias("r"), $"l.rowid" === $"r.rowid", "full_outer")
  .where(filterCondition)

Output：

+--------+--------+--------+-----------+------+--------+--------+--------+-----------+------+
|rowid   |name    |status  |lastupdated|source|rowid   |name    |status  |lastupdated|source|
+--------+--------+--------+-----------+------+--------+--------+--------+-----------+------+
|1-za23f1|product2|inactive|31-12-2019 |rdd1  |1-za23f1|product2|active  |31-12-2019 |rdd2  |
|1-za23f2|product3|inactive|01-01-2020 |rdd1  |1-za23f2|product3|active  |01-01-2020 |rdd2  |
|1-za23f3|product4|inactive|02-01-2020 |rdd1  |1-za23f3|product1|inactive|02-01-2020 |rdd2  |
+--------+--------+--------+-----------+------+--------+--------+--------+-----------+------+

如何在兩個不同的spark rdd中獲取所有不同的記錄

問題描述

3 個解決方案

解決方案1
2 2020-06-14 12:56:29

解決方案2
0 2020-06-14 07:25:42

解決方案3
0 2020-06-15 07:29:55

如何在兩個不同的spark rdd中獲取所有不同的記錄

問題描述

3 個解決方案

解決方案1 2 2020-06-14 12:56:29

解決方案2 0 2020-06-14 07:25:42

解決方案3 0 2020-06-15 07:29:55

解決方案1
2 2020-06-14 12:56:29

解決方案2
0 2020-06-14 07:25:42

解決方案3
0 2020-06-15 07:29:55