简体   繁体   English

Apache Spark-如何从两个RDD获取不匹配的行

[英]Apache Spark - how to get unmatched rows from two RDDs

I have two different RDDs, each RDD have some common fields, based on that fields i want to get unmatched records from RDD1 or RDD2.[Records available in RDD1 but not available in RDD2] [Records available in RDD2 but not available in RDD1] 我有两个不同的RDD,每个RDD都有一些共同的字段,基于这些字段,我想从RDD1或RDD2获取不匹配的记录。[记录在RDD1中可用,但在RDD2中不可用] [在RDD2中可用但在RDD1中不可用的记录]

It seems we could use subtract or subtractbyKey . 看来我们可以使用subtractsubtractbyKey

Sample Input: 输入样例:

**File 1:**

sam,23,cricket
alex,34,football
ann,21,football

**File 2:**

ruby,25,football,usa
alex,34,cricket,usa
ann,21,cricket,usa

**expected output:**

sam,23,cricket

Update: 更新:

Currently i am using Spark SQL to get the unmatched records from the RDDs(Writing a query to get the unmatched records). 目前,我正在使用Spark SQL从RDD获取不匹配的记录(编写查询以获取不匹配的记录)。

What i am looking is, is it something we can do it with Spark Core itself instead of using Spark SQL and also i am not looking the code, is there any operation available in Spark Core? 我正在寻找的是,是否可以使用Spark Core本身而不是使用Spark SQL来完成它,而且我也没有查找代码,Spark Core中是否有可用的操作?

Please advise on this. 请对此提供建议。

Regards, Shankar. 问候,香卡。

You could bring both RDDs to the same shape and use subtract to remove the common elements. 您可以将两个RDD设置为相同的形状,并使用subtract删除公共元素。

Given rdd1 from file1 and rdd2 from file2 as presented above, you could do something like: 鉴于rdd1file1rdd2file2如上所,你可以这样做:

val userScore2 = rdd2.map{case (name, score, sport, country) => (name, score, sport)}

val in1andNotin2 = rdd1 subtract userScore2

val in2andNotIn1 = userScore2 subtract rdd1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM