Apache Spark - how to get unmatched rows from two RDDs

Question

I have two different RDDs, each RDD have some common fields, based on that fields i want to get unmatched records from RDD1 or RDD2.[Records available in RDD1 but not available in RDD2] [Records available in RDD2 but not available in RDD1]

It seems we could use subtract or subtractbyKey .

Sample Input:

**File 1:**

sam,23,cricket
alex,34,football
ann,21,football

**File 2:**

ruby,25,football,usa
alex,34,cricket,usa
ann,21,cricket,usa

**expected output:**

sam,23,cricket

Update:

Currently i am using Spark SQL to get the unmatched records from the RDDs(Writing a query to get the unmatched records).

What i am looking is, is it something we can do it with Spark Core itself instead of using Spark SQL and also i am not looking the code, is there any operation available in Spark Core?

Please advise on this.

Regards, Shankar.

Answer 1

You could bring both RDDs to the same shape and use subtract to remove the common elements.

Given rdd1 from file1 and rdd2 from file2 as presented above, you could do something like:

val userScore2 = rdd2.map{case (name, score, sport, country) => (name, score, sport)}

val in1andNotin2 = rdd1 subtract userScore2

val in2andNotIn1 = userScore2 subtract rdd1

Apache Spark - how to get unmatched rows from two RDDs

Question

1 answers

solution1
0 ACCPTED 2015-07-01 14:24:22

Apache Spark - how to get unmatched rows from two RDDs

Question

1 answers

solution1 0 ACCPTED 2015-07-01 14:24:22

solution1
0 ACCPTED 2015-07-01 14:24:22