合并两个RDD Spark Spark Scala，保持右侧

Question

I have two spark dataframes, with the following structure. 我有两个Spark数据框，结构如下。 As read before using sqlContext. 如使用sqlContext之前阅读的。

 itens.columns (scala command) 
 Array[String] = Array(id_location,id_item, name, price)

 rdd1 
 [1,1,item A,10]
 [1,2,item b,12]
 [1,3,item c,12]

 rdd2
 [1,2,item b,50]
 [1,4,item c,12]
 [1,5,item c,12]

I want the following result based on the composite key (id_location,id_item) 我想要基于组合键（id_location，id_item）的以下结果

 [1,1,item A,10]
 [1,2,item b,50]
 [1,3,item c,12]
 [1,4,item c,12]
 [1,5,item c,12]

So, I want a result with distinct itens (regarding the composite key), but when I found a record with the same key in the both rdds, I want just keep with the record from rdd2. 因此，我想要一个具有不同iten的结果（关于组合键），但是当我在两个rdds中找到具有相同键的记录时，我只想保留rdd2中的记录。

Anyone have this kind of requirement ? 有人有这种要求吗？

I am working with spark and scala. 我正在使用Spark和Scala。

Best Regards Raphael. 最好的问候拉斐尔。

Answer 1

I'm very new to Spark, so there may be a better way of doing this, but could you perhaps map to a pair RDD (based on your composite key), then perform a fullOuterJoin, using only the "right" element in the resulting data where there's data for both the "left" and "right" sides? 我是Spark的新手，所以可能有更好的方法，但是您是否可以映射到一对RDD（基于您的复合键），然后执行fullOuterJoin，仅使用结果数据中“左”边和“右”边都有数据吗？

Rough pseudo code: 粗糙的伪代码：

val pairRdd1 = rdd1 map {
  line => 
    (line(0)+line(1), line)
}

val pairRdd2 = rdd2 map {
  line => 
    (line(0)+line(1), line)
}

val joined = pairRdd1.fullOuterJoin(pairRdd2)

joined map {
  (id, left, right) =>
    right.getOrElse(left.get)
}

If I get time in the morning, I'll try and knock together a working example. 如果我早上有时间，我将尝试结合一个可行的例子。 Hope that helps! 希望有帮助！

Answer 2

@Steven has the right idea. @Steven有正确的想法。 You need to map your datasets to key-value pairs and then perform an outerjoin 您需要将数据集映射到键值对，然后执行外部outerjoin

val rdd1 = sc.parallelize(List((1,1,"item A",10),(1,2,"item b",12),(1,3,"item c",12)))
val rdd2 = sc.parallelize(List((1,2,"item b",50),(1,4,"item c",12),(1,5,"item c",12)))

val rdd1KV = rdd1.map{case(id_location,id_item, name, price) => ((id_location, id_item), (name, price))}
val rdd2KV = rdd2.map{case(id_location,id_item, name, price) => ((id_location, id_item), (name, price))}

val joined = rdd1KV.fullOuterJoin(rdd2KV)

val res = joined.map{case((id_location, id_item),(leftOption, rightOption)) =>
    val values = rightOption.getOrElse(leftOption.get)
    (id_location, id_item, values._1, values._2)
}

This will get you the result you are looking for. 这将为您提供所需的结果。

Answer 3

Looks like @Steven's answer is logically good but could run into issues if your data doesn't have many intersecting elements (ie a full outer join will produce a huge data set). 看起来@Steven的回答在逻辑上是不错的，但是如果您的数据没有很多相交的元素（即完整的外部联接将产生巨大的数据集），则可能会遇到问题。 You are also using DataFrames, so converting to RDDs and then back to DataFrames seems excessive for a task that can be done with the DataFrames API. 您还使用了DataFrames，因此对于可以通过DataFrames API完成的任务而言，转换为RDD然后再转换为DataFrames似乎过多。 I'll describe how to do this below. 我将在下面介绍如何执行此操作。

Let's start with some sample data (taken from your example): 让我们从一些示例数据开始（从您的示例中获取）：

val rdd1 = sc.parallelize(Array((1,1,"item A",10), (1,2,"item b",12), (1,3,"item c",12)))
val rdd2 = sc.parallelize(Array((1,2,"item b",50), (1,4,"item c",12), (1,5,"item c",12)))

Next, we can convert them to DataFrames under separate column aliases. 接下来，我们可以在单独的列别名下将它们转换为DataFrames。 Wwe use different aliases across df1 and df2 here because when we eventually join these two DataFrames, the subsequent select can be written easier (if there is a way to identify the origin of a column after a join, this is unnecessary). 在这里，我们在df1和df2使用了不同的别名，因为当我们最终连接这两个DataFrame时，可以更容易地编写后续的select（如果有一种方法可以在连接后标识列的来源，则没有必要）。 Note that the union of both DataFrames contains the row you want filtered. 请注意，两个DataFrame的并集都包含要过滤的行。

val df1 = rdd1.toDF("id_location", "id_item", "name", "price")
val df2 = rdd2.toDF("id_location_2", "id_item_2", "name_2", "price_2")

// df1.unionAll(df2).show()
// +-----------+-------+------+-----+
// |id_location|id_item|  name|price|
// +-----------+-------+------+-----+
// |          1|      1|item A|   10|
// |          1|      2|item b|   12|
// |          1|      3|item c|   12|
// |          1|      2|item b|   50|
// |          1|      4|item c|   12|
// |          1|      5|item c|   12|
// +-----------+-------+------+-----+

Here, we start by joining the two DataFrames on the key being the first two elements of df1 and df2 . 在这里，我们首先将键上的两个DataFrames连接在一起，它们是df1和df2的前两个元素。 Then, we create another DataFrame by selecting the rows (essentially from df1 ) where there exists a row from df2 with the same join key. 然后，我们通过选择行（基本上来自df1 ）来创建另一个DataFrame，其中存在来自df2具有相同联接键的行。 After that, we run an except on df1 to remove all rows from that previously created DataFrame. 之后，我们在df1上运行except，以从先前创建的DataFrame中删除所有行。 This can be seen as a complement because what we have basically done is delete all rows from df1 where there exists an identical ("id_location", "id_item") in df2 . 这可以看作是一个补充，因为我们基本上要做的是从df1中删除所有行，而df2存在相同的行("id_location", "id_item") 。 Finally, we union together the complement with df2 to produce the output DataFrame. 最后，我们将补码与df2结合在一起以生成输出DataFrame。

val df_joined = df1.join(df2, (df1("id_location") === df2("id_location_2")) && (df1("id_item") === df2("id_item_2")))
val df1_common_keyed = df_joined.select($"id_location", $"id_item", $"name", $"price")
val df1_complement = df1.except(df1_common_keyed)
val df_union = df1_complement.unionAll(df2)

// df_union.show()
// +-----------+-------+------+-----+
// |id_location|id_item|  name|price|
// +-----------+-------+------+-----+
// |          1|      3|item c|   12|
// |          1|      1|item A|   10|
// |          1|      2|item b|   50|
// |          1|      4|item c|   12|
// |          1|      5|item c|   12|
// +-----------+-------+------+-----+

Again, like @Steven suggested, you could use the RDD API by converting your DataFrames to RDD's and running with that. 同样，就像@Steven所建议的那样，您可以通过将DataFrames转换为RDD并与其一起运行来使用RDD API。 If that's what you want to do, the following is another way of accomplishing what you would like using subtractByKey() and the input RDD's from above: 如果这是您要执行的操作，则以下是使用上面的subtractByKey()和输入RDD来完成所需操作的另一种方法：

val keyed1 = rdd1.keyBy { case (id_location, id_item, _, _) => (id_location, id_item) }
val keyed2 = rdd2.keyBy { case (id_location, id_item, _, _) => (id_location, id_item) }
val unionRDD = keyed1.subtractByKey(keyed2).values.union(rdd2)

// unionRDD.collect().foreach(println)
// (1,1,item A,10)
// (1,3,item c,12)
// (1,2,item b,50)
// (1,4,item c,12)
// (1,5,item c,12)

合并两个RDD Spark Spark Scala，保持右侧

问题描述

3 个解决方案

解决方案1
0 2015-10-21 23:42:36

解决方案2
0 2015-10-22 06:51:11

解决方案3
0 2015-10-22 06:59:29

合并两个RDD Spark Spark Scala，保持右侧

问题描述

3 个解决方案

解决方案1 0 2015-10-21 23:42:36

解决方案2 0 2015-10-22 06:51:11

解决方案3 0 2015-10-22 06:59:29

解决方案1
0 2015-10-21 23:42:36

解决方案2
0 2015-10-22 06:51:11

解决方案3
0 2015-10-22 06:59:29