简体   繁体   English

spark的distinct()函数是否仅对每个分区中的不同元组进行洗牌

[英]Does spark's distinct() function shuffle only the distinct tuples from each partition

As I understand distinct() hash partitions the RDD to identify the unique keys. 据我所知,distinct()散列分区RDD以识别唯一键。 But does it optimize on moving only the distinct tuples per partition? 但它是否优化了每个分区只移动不同的元组?

Imagine an RDD with the following partitions 想象一下具有以下分区的RDD

  1. [1, 2, 2, 1, 4, 2, 2] [1,2,2,1,4,2,2]
  2. [1, 3, 3, 5, 4, 5, 5, 5] [1,3,3,5,4,5,5,5]

On a distinct on this RDD, would all the duplicate keys (2s in partition 1 and 5s in partition 2) get shuffled to their target partition or will only the distinct keys per partition get shuffled to the target? 在这个RDD的一个独特的地方,所有重复的密钥(分区1中的2s和分区2中的5s)是否会被混洗到它们的目标分区,或者只有每个分区的不同密钥被洗牌到目标?

If all keys get shuffled then an aggregate() with set() operations will reduce the shuffle. 如果所有键都被洗牌,那么带有set()操作的aggregate()将减少shuffle。

def set_update(u, v):
    u.add(v)
    return u
rdd.aggregate(set(), set_update, lambda u1,u2: u1|u2)

unique is implemented via reduceByKey on (element, None) pairs. unique是通过(element, None)对上的reduceByKey实现的。 So it shuffles only unique values per partition. 因此,它每个分区只会刷新唯一值。 If number of duplicates is low it is still quite expensive operation though. 如果重复数量很少,那么仍然是相当昂贵的操作。

There are situations when using set can be useful. 有些情况下使用set可能很有用。 In particular if you call distinct on PairwseRDD you may prefer to aggregateByKey / combineByKey instead to achieve both deduplication and partitioning by key at the same time. 特别是如果你在PairwseRDD上调用distinct ,你可能更喜欢使用aggregateByKey / combineByKey来同时实现重复数据删除和按键分区。 In particular consider following code: 特别考虑以下代码:

rdd1 = sc.parallelize([("foo", 1), ("foo", 1), ("bar", 1)])
rdd2 = sc.parallelize([("foo", "x"), ("bar", "y")])
rdd1.distinct().join(rdd2)

It has to shuffle rdd1 twice - once for distinct and once for join . 它有洗牌rdd1两次-一次distinct了,一旦join Instead you can use combineByKey : 相反,您可以使用combineByKey

def flatten(kvs):
    (key, (left, right)) = kvs
    for v in left:
        yield (key, (v, right))

aggregated = (rdd1
    .aggregateByKey(set(), set_update, lambda u1, u2: u1 | u2))

rdd2_partitioned = rdd2.partitionBy(aggregated.getNumPartitions())

(aggregated.join(rdd2_partitioned)
    .flatMap(flatten))

Note : 注意

join logic is a little bit different in Scala than in Python (PySpark is using union followed by groupByKey , see Spark RDD groupByKey + join vs join performance for Python and Scala DAGs), hence we have to manually partition the second RDD before we call join. join逻辑在Scala中有点不同于Python(PySpark使用union后跟groupByKey ,请参阅Spark RDD groupByKey +连接vs连接性能,用于Python和Scala DAG),因此我们必须在调用join之前手动分区第二个RDD

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM