如何在两个不同的spark rdd中获取所有不同的记录

Question

Very very new to spark and RDD's so I hope I explain what I'm after well enough for someone to understand and help:) spark和RDD非常新，所以我希望我能很好地解释我所追求的东西，以便有人理解和帮助:)

I have two very large sets of data, lets say 3 million rows with 50 columns which is stored in hadoop hdfs.我有两组非常大的数据，假设 300 万行 50 列存储在 hadoop hdfs 中。 What I would like to do is read both of these into RDD's so that it uses the parallelism & I would like to return a 3rd RDD that contains all records (from either RDD) that do not match.我想做的是将这两个读入RDD，以便它使用并行性&我想返回一个包含所有不匹配的记录（来自任一RDD）的第三个RDD。

Below hopefully helps show what I'm looking to do... Just trying to find all different records in the fastest most efficient way...希望下面有助于显示我想要做什么......只是试图以最快最有效的方式找到所有不同的记录......

Data is not necessarily in the same order - row 1 of rdd1 may be row 4 of rdd2.数据不一定是相同的顺序 - rdd1 的第 1 行可能是 rdd2 的第 4 行。

many thanks in advance!!提前谢谢了！！

So... This seems to be doing what I want it to, but it seems far to easy to be correct...所以......这似乎正在做我想要的，但它似乎很容易正确......

%spark

import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
import sqlContext.implicits._
import org.apache.spark.sql._

//create the tab1 rdd.
val rdd1 = sqlContext.sql("select * FROM table1").withColumn("source",lit("tab1"))

//create the tab2 rdd.
val rdd2 = sqlContext.sql("select * FROM table2").withColumn("source",lit("tab2"))

//create the rdd of all misaligned records between table1 and the table2.
val rdd3 = rdd1.except(rdd2).unionAll(rdd2.except(rdd1))

//rdd3.printSchema()    
//val rdd3 = rdd1.except(rdd2)

//drop the temporary table that was used to create a hive compatible table from the last run.
sqlContext.dropTempTable("table3")

//register the new temporary table.
rdd3.toDF().registerTempTable("table3")

//drop the old compare table.
sqlContext.sql("drop table if exists data_base.compare_table")

//create the new version of the s_asset compare table.
sqlContext.sql("create table data_base.compare_table as select * from table3")

This is the final bit of code i've ended up on so far which seems to be doing the job - not sure on performance on the full dataset, will keep my fingers crossed...这是我到目前为止完成的最后一点代码，它似乎正在完成这项工作 - 不确定完整数据集的性能，会让我的手指交叉......

many thanks to all that took the time to help this poor pleb out:)非常感谢所有花时间帮助这个可怜的平民:)

ps if anyone has a solution with a little more performance I'd love to hear it. ps如果有人有性能更高的解决方案，我很想听听。 or if you can see some issue with this that may mean it will return the wrong results.或者，如果您可以看到一些问题，这可能意味着它会返回错误的结果。

Answer 1

Load your both Dataframes as df1 , df2将您的两个数据框加载为df1 ， df2
Add a source column with default value as rdd1 and rdd2 respectively添加默认值分别为rdd1和rdd2的source列
Union df1 and df2联合df1和df2
Group by "rowid", "name", "status", "lastupdated" and collect its sources as set按"rowid", "name", "status", "lastupdated"并按集合收集其来源
Filter all rows which has single source过滤所有具有单一来源的行

import org.apache.spark.sql.functions._

object OuterJoin {

  def main(args: Array[String]): Unit = {

    val spark = Constant.getSparkSess

    import spark.implicits._

    val cols = Array("rowid", "name", "status", "lastupdated")

    val df1 = List(
      ("1-za23f0", "product1", "active", "30-12-2019"),
      ("1-za23f1", "product2", "inactive", "31-12-2019"),
      ("1-za23f2", "product3", "inactive", "01-01-2020"),
      ("1-za23f3", "product4", "inactive", "02-01-2020"),
      ("1-za23f4", "product5", "inactive", "03-01-2020"))
      .toDF(cols:_ *)
      .withColumn("source",lit("rdd1"))

    val df2 = List(
      ("1-za23f0", "product1", "active", "30-12-2019"),
      ("1-za23f1", "product2", "active", "31-12-2019"),
      ("1-za23f2", "product3", "active", "01-01-2020"),
      ("1-za23f3", "product1", "inactive", "02-01-2020"),
      ("1-za23f4", "product5", "inactive", "03-01-2020"))
      .toDF(cols:_ *)
        .withColumn("source",lit("rdd2"))

    df1.union(df2)
      .groupBy(cols.map(col):_ *)
      .agg(collect_set("source").as("sources"))
      .filter(size(col("sources")) === 1)
      .withColumn("from_rdd", explode(col("sources") ))
      .drop("sources")
      .show()
  }

}

Answer 2

you can rather read the data into dataframes and not into Rdds and then use union and group by to achieve the result您可以将数据读入数据帧而不是 Rdds，然后使用 union 和 group by 来实现结果

Answer 3

Both can be joined with "full_outer", and then filter applied, where field value compared in both:两者都可以与“full_outer”连接，然后应用过滤器，其中字段值在两者中进行比较：

val filterCondition = cols
  .map(c => (col(s"l.$c") =!= col(s"r.$c") || col(s"l.$c").isNull || col(s"r.$c").isNull))
  .reduce((acc, c) => acc || c)

df1.alias("l")
  .join(df2.alias("r"), $"l.rowid" === $"r.rowid", "full_outer")
  .where(filterCondition)

Output: Output：

+--------+--------+--------+-----------+------+--------+--------+--------+-----------+------+
|rowid   |name    |status  |lastupdated|source|rowid   |name    |status  |lastupdated|source|
+--------+--------+--------+-----------+------+--------+--------+--------+-----------+------+
|1-za23f1|product2|inactive|31-12-2019 |rdd1  |1-za23f1|product2|active  |31-12-2019 |rdd2  |
|1-za23f2|product3|inactive|01-01-2020 |rdd1  |1-za23f2|product3|active  |01-01-2020 |rdd2  |
|1-za23f3|product4|inactive|02-01-2020 |rdd1  |1-za23f3|product1|inactive|02-01-2020 |rdd2  |
+--------+--------+--------+-----------+------+--------+--------+--------+-----------+------+

如何在两个不同的spark rdd中获取所有不同的记录

问题描述

3 个解决方案

解决方案1
2 2020-06-14 12:56:29

解决方案2
0 2020-06-14 07:25:42

解决方案3
0 2020-06-15 07:29:55

如何在两个不同的spark rdd中获取所有不同的记录

问题描述

3 个解决方案

解决方案1 2 2020-06-14 12:56:29

解决方案2 0 2020-06-14 07:25:42

解决方案3 0 2020-06-15 07:29:55

解决方案1
2 2020-06-14 12:56:29

解决方案2
0 2020-06-14 07:25:42

解决方案3
0 2020-06-15 07:29:55