简体   繁体   中英

Apache Spark - how to get unmatched rows from two RDDs

I have two different RDDs, each RDD have some common fields, based on that fields i want to get unmatched records from RDD1 or RDD2.[Records available in RDD1 but not available in RDD2] [Records available in RDD2 but not available in RDD1]

It seems we could use subtract or subtractbyKey .

Sample Input:

**File 1:**

sam,23,cricket
alex,34,football
ann,21,football

**File 2:**

ruby,25,football,usa
alex,34,cricket,usa
ann,21,cricket,usa

**expected output:**

sam,23,cricket

Update:

Currently i am using Spark SQL to get the unmatched records from the RDDs(Writing a query to get the unmatched records).

What i am looking is, is it something we can do it with Spark Core itself instead of using Spark SQL and also i am not looking the code, is there any operation available in Spark Core?

Please advise on this.

Regards, Shankar.

You could bring both RDDs to the same shape and use subtract to remove the common elements.

Given rdd1 from file1 and rdd2 from file2 as presented above, you could do something like:

val userScore2 = rdd2.map{case (name, score, sport, country) => (name, score, sport)}

val in1andNotin2 = rdd1 subtract userScore2

val in2andNotIn1 = userScore2 subtract rdd1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM