在Spark Python中对RDD执行设置差异

Question

I have two spark RDDs, A has 301,500,000 rows and B has 1,500,000 rows. 我有两个火花RDD，A有301,500,000行，B有1,500,000行。 Those 1.5 million rows in B all appear in A as well. B中的150万行也出现在A中。 I would like the set difference between those two RDDs, such that I return A with 300,000,000 rows, with those 1,500,000 rows from B no longer present in A. 我想在这两个RDD之间设置差异，这样我返回A有300,000,000行，而B中的那些1,500,000行不再出现在A.

I cannot use Spark DataFrames. 我不能使用Spark DataFrames。

Here is the system I am using right now. 这是我现在使用的系统。 These RDDs have primary keys. 这些RDD具有主键。 What I do below is create a (collected) list of those primary keys that appear in B, then iterate through the primary keys of A to find those which do not appear in the list of B primary keys. 我在下面做的是创建一个（收集的）列表，显示出现在B中的主键，然后遍历A的主键，找到那些没有出现在B主键列表中的主键。

a = sc.parallelize([[0,"foo",'a'],[1,'bar','b'],[2,'mix','c'],[3,'hem', 'd'],[4,'line','e']])
b = sc.parallelize([[1,'bar','b'],[2,'mix','c']])
b_primary_keys = b.map(lambda x: x[0]).collect()  # since first col = primary key


def sep_a_and_b(row):
    primary_key = row[0]
    if(primary_key not in b_primary_keys):
        return(row)


a_minus_b = a.map(lambda x: sep_a_and_b(x)).filter(lambda x: x != None)

Now, this works in this sample problem because A and B are tiny. 现在，这适用于此示例问题，因为A和B很小。 However, this is unsuccessful when I use my true datasets A and B. Is there a better (more parallel) way to implement this? 但是，当我使用我的真实数据集A和B时，这是不成功的。是否有更好（更平行）的方法来实现它？

Answer 1

This seems like something you can solve with a subtractByKey 这似乎是你可以使用subtractByKey解决的问题

val filteredA = a.subtractByKey(b)

To change to a key value: 要更改为键值：

val keyValRDD = rdd.map(lambda x: (x[:1],x[1:]))

*Note that my python is weak and there might be better ways to split the values *请注意，我的python很弱，可能有更好的方法来分割值

在Spark Python中对RDD执行设置差异

问题描述

1 个解决方案

解决方案1
4 已采纳 2015-09-29 16:05:24

在Spark Python中对RDD执行设置差异

问题描述

1 个解决方案

解决方案1 4 已采纳 2015-09-29 16:05:24

解决方案1
4 已采纳 2015-09-29 16:05:24