简体   繁体   中英

union two RDDs Spark scala, keeping the right side

I have two spark dataframes, with the following structure. As read before using sqlContext.

 itens.columns (scala command) 
 Array[String] = Array(id_location,id_item, name, price)

 rdd1 
 [1,1,item A,10]
 [1,2,item b,12]
 [1,3,item c,12]

 rdd2
 [1,2,item b,50]
 [1,4,item c,12]
 [1,5,item c,12]

I want the following result based on the composite key (id_location,id_item)

 [1,1,item A,10]
 [1,2,item b,50]
 [1,3,item c,12]
 [1,4,item c,12]
 [1,5,item c,12]

So, I want a result with distinct itens (regarding the composite key), but when I found a record with the same key in the both rdds, I want just keep with the record from rdd2.

Anyone have this kind of requirement ?

I am working with spark and scala.

Best Regards Raphael.

I'm very new to Spark, so there may be a better way of doing this, but could you perhaps map to a pair RDD (based on your composite key), then perform a fullOuterJoin, using only the "right" element in the resulting data where there's data for both the "left" and "right" sides?

Rough pseudo code:

val pairRdd1 = rdd1 map {
  line => 
    (line(0)+line(1), line)
}

val pairRdd2 = rdd2 map {
  line => 
    (line(0)+line(1), line)
}

val joined = pairRdd1.fullOuterJoin(pairRdd2)

joined map {
  (id, left, right) =>
    right.getOrElse(left.get)
}

If I get time in the morning, I'll try and knock together a working example. Hope that helps!

@Steven has the right idea. You need to map your datasets to key-value pairs and then perform an outerjoin

val rdd1 = sc.parallelize(List((1,1,"item A",10),(1,2,"item b",12),(1,3,"item c",12)))
val rdd2 = sc.parallelize(List((1,2,"item b",50),(1,4,"item c",12),(1,5,"item c",12)))

val rdd1KV = rdd1.map{case(id_location,id_item, name, price) => ((id_location, id_item), (name, price))}
val rdd2KV = rdd2.map{case(id_location,id_item, name, price) => ((id_location, id_item), (name, price))}

val joined = rdd1KV.fullOuterJoin(rdd2KV)

val res = joined.map{case((id_location, id_item),(leftOption, rightOption)) =>
    val values = rightOption.getOrElse(leftOption.get)
    (id_location, id_item, values._1, values._2)
}

This will get you the result you are looking for.

Looks like @Steven's answer is logically good but could run into issues if your data doesn't have many intersecting elements (ie a full outer join will produce a huge data set). You are also using DataFrames, so converting to RDDs and then back to DataFrames seems excessive for a task that can be done with the DataFrames API. I'll describe how to do this below.

Let's start with some sample data (taken from your example):

val rdd1 = sc.parallelize(Array((1,1,"item A",10), (1,2,"item b",12), (1,3,"item c",12)))
val rdd2 = sc.parallelize(Array((1,2,"item b",50), (1,4,"item c",12), (1,5,"item c",12)))

Next, we can convert them to DataFrames under separate column aliases. Wwe use different aliases across df1 and df2 here because when we eventually join these two DataFrames, the subsequent select can be written easier (if there is a way to identify the origin of a column after a join, this is unnecessary). Note that the union of both DataFrames contains the row you want filtered.

val df1 = rdd1.toDF("id_location", "id_item", "name", "price")
val df2 = rdd2.toDF("id_location_2", "id_item_2", "name_2", "price_2")

// df1.unionAll(df2).show()
// +-----------+-------+------+-----+
// |id_location|id_item|  name|price|
// +-----------+-------+------+-----+
// |          1|      1|item A|   10|
// |          1|      2|item b|   12|
// |          1|      3|item c|   12|
// |          1|      2|item b|   50|
// |          1|      4|item c|   12|
// |          1|      5|item c|   12|
// +-----------+-------+------+-----+

Here, we start by joining the two DataFrames on the key being the first two elements of df1 and df2 . Then, we create another DataFrame by selecting the rows (essentially from df1 ) where there exists a row from df2 with the same join key. After that, we run an except on df1 to remove all rows from that previously created DataFrame. This can be seen as a complement because what we have basically done is delete all rows from df1 where there exists an identical ("id_location", "id_item") in df2 . Finally, we union together the complement with df2 to produce the output DataFrame.

val df_joined = df1.join(df2, (df1("id_location") === df2("id_location_2")) && (df1("id_item") === df2("id_item_2")))
val df1_common_keyed = df_joined.select($"id_location", $"id_item", $"name", $"price")
val df1_complement = df1.except(df1_common_keyed)
val df_union = df1_complement.unionAll(df2)

// df_union.show()
// +-----------+-------+------+-----+
// |id_location|id_item|  name|price|
// +-----------+-------+------+-----+
// |          1|      3|item c|   12|
// |          1|      1|item A|   10|
// |          1|      2|item b|   50|
// |          1|      4|item c|   12|
// |          1|      5|item c|   12|
// +-----------+-------+------+-----+

Again, like @Steven suggested, you could use the RDD API by converting your DataFrames to RDD's and running with that. If that's what you want to do, the following is another way of accomplishing what you would like using subtractByKey() and the input RDD's from above:

val keyed1 = rdd1.keyBy { case (id_location, id_item, _, _) => (id_location, id_item) }
val keyed2 = rdd2.keyBy { case (id_location, id_item, _, _) => (id_location, id_item) }
val unionRDD = keyed1.subtractByKey(keyed2).values.union(rdd2)

// unionRDD.collect().foreach(println)
// (1,1,item A,10)
// (1,3,item c,12)
// (1,2,item b,50)
// (1,4,item c,12)
// (1,5,item c,12)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM