[英]Does spark's distinct() function shuffle only the distinct tuples from each partition
As I understand distinct() hash partitions the RDD to identify the unique keys. 据我所知,distinct()散列分区RDD以识别唯一键。 But does it optimize on moving only the distinct tuples per partition? 但它是否优化了每个分区只移动不同的元组?
Imagine an RDD with the following partitions 想象一下具有以下分区的RDD
On a distinct on this RDD, would all the duplicate keys (2s in partition 1 and 5s in partition 2) get shuffled to their target partition or will only the distinct keys per partition get shuffled to the target? 在这个RDD的一个独特的地方,所有重复的密钥(分区1中的2s和分区2中的5s)是否会被混洗到它们的目标分区,或者只有每个分区的不同密钥被洗牌到目标?
If all keys get shuffled then an aggregate() with set() operations will reduce the shuffle. 如果所有键都被洗牌,那么带有set()操作的aggregate()将减少shuffle。
def set_update(u, v):
u.add(v)
return u
rdd.aggregate(set(), set_update, lambda u1,u2: u1|u2)
unique
is implemented via reduceByKey
on (element, None)
pairs. unique
是通过(element, None)
对上的reduceByKey
实现的。 So it shuffles only unique values per partition. 因此,它每个分区只会刷新唯一值。 If number of duplicates is low it is still quite expensive operation though. 如果重复数量很少,那么仍然是相当昂贵的操作。
There are situations when using set
can be useful. 有些情况下使用set
可能很有用。 In particular if you call distinct
on PairwseRDD
you may prefer to aggregateByKey
/ combineByKey
instead to achieve both deduplication and partitioning by key at the same time. 特别是如果你在PairwseRDD
上调用distinct
,你可能更喜欢使用aggregateByKey
/ combineByKey
来同时实现重复数据删除和按键分区。 In particular consider following code: 特别考虑以下代码:
rdd1 = sc.parallelize([("foo", 1), ("foo", 1), ("bar", 1)])
rdd2 = sc.parallelize([("foo", "x"), ("bar", "y")])
rdd1.distinct().join(rdd2)
It has to shuffle rdd1
twice - once for distinct
and once for join
. 它有洗牌rdd1
两次-一次distinct
了,一旦join
。 Instead you can use combineByKey
: 相反,您可以使用combineByKey
:
def flatten(kvs):
(key, (left, right)) = kvs
for v in left:
yield (key, (v, right))
aggregated = (rdd1
.aggregateByKey(set(), set_update, lambda u1, u2: u1 | u2))
rdd2_partitioned = rdd2.partitionBy(aggregated.getNumPartitions())
(aggregated.join(rdd2_partitioned)
.flatMap(flatten))
Note : 注意 :
join
logic is a little bit different in Scala than in Python (PySpark is using union
followed by groupByKey
, see Spark RDD groupByKey + join vs join performance for Python and Scala DAGs), hence we have to manually partition the second RDD
before we call join. join
逻辑在Scala中有点不同于Python(PySpark使用union
后跟groupByKey
,请参阅Spark RDD groupByKey +连接vs连接性能,用于Python和Scala DAG),因此我们必须在调用join之前手动分区第二个RDD
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.