简体   繁体   English

Apache Spark:加入两个具有不同分区器的 RDD

[英]Apache Spark: Join two RDDs with different partitioners

I have 2 rdds with different set of partitioners.我有 2 个带有不同分区器的 rdds。

case class Person(name: String, age: Int, school: String)
case class School(name: String, address: String)

rdd1 is the RDD of Person , which I have partitioned based on age of the person, and then converted the key to school . rdd1Person的 RDD,我根据人的age进行了分区,然后将 key 转换为school

val rdd1: RDD[Person] = rdd1.keyBy(person => (person.age, person))
                            .partitionBy(new HashPartitioner(10))
                            .mapPartitions(persons => 
                                 persons.map{case(age,person) => 
                                    (person.school, person)
                            })

rdd2 is the RDD of School grouped by name of the school. rdd2是按School name分组的School的 RDD。

val rdd2: RDD[School] = rdd2.groupBy(_.name)

Now, rdd1 is partitioned based on age of the person, so all persons with same age goes to same partitions.现在, rdd1是根据人的年龄进行分区的,所以所有相同年龄的人都进入相同的分区。 And, rdd2 is partitioned(by default) based on the name of the school.并且, rdd2根据学校名称进行分区(默认情况下)。

I want to rdd1.leftOuterJoin(rdd2) in such a way that rdd1 doesn't get shuffled because rdd1 is very very big compared to rdd2.我想rdd1.leftOuterJoin(rdd2)以这样的方式rdd1不会被洗牌,因为与 rdd2 相比,rdd1 非常非常大。 Also, I'm outputting the result to Cassandra which is partitioned on age , so current partitioning of rdd1 will fasten the process of writing later.另外,我将结果输出到按age分区的 Cassandra,因此当前对rdd1分区将加快后面的写入过程。

Is there a way to join there two RDDs without: 1. Shuffling rdd1 and 2. Broadcasting 'rdd2', because rdd2 is bigger than the available memory.有没有办法在没有以下情况下加入两个 RDD:1. 混洗rdd1和 2. 广播“rdd2”,因为rdd2大于可用内存。

Note: The joined rdd should be partitioned based on age.注意:加入的 rdd 应该根据年龄进行分区。

Suppose you have two rdds, rdd1 and rdd2 and want to apply join operation.假设您有两个 rdds,rdd1 和 rdd2,并且想要应用连接操作。 and if rdds has partitioned (partition is set).并且如果 rdds 已分区(设置了分区)。 then calling rdd3 = rdd1.join(rdd2) will make rdd3 partition by rdd1.然后调用 rdd3 = rdd1.join(rdd2) 将使 rdd3 由 rdd1 分区。 rdd3 will always take hash partition from rdd1 (first parent, the one that join was called on). rdd3 将始终从 rdd1 中获取哈希分区(第一个父级,调用 join 的那个)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM