[英]Apache Spark: Join two RDDs with different partitioners
I have 2 rdds with different set of partitioners.我有 2 个带有不同分区器的 rdds。
case class Person(name: String, age: Int, school: String)
case class School(name: String, address: String)
rdd1
is the RDD of Person
, which I have partitioned based on age
of the person, and then converted the key to school
. rdd1
是Person
的 RDD,我根据人的age
进行了分区,然后将 key 转换为school
。
val rdd1: RDD[Person] = rdd1.keyBy(person => (person.age, person))
.partitionBy(new HashPartitioner(10))
.mapPartitions(persons =>
persons.map{case(age,person) =>
(person.school, person)
})
rdd2
is the RDD of School
grouped by name
of the school. rdd2
是按School
name
分组的School
的 RDD。
val rdd2: RDD[School] = rdd2.groupBy(_.name)
Now, rdd1
is partitioned based on age of the person, so all persons with same age goes to same partitions.现在,
rdd1
是根据人的年龄进行分区的,所以所有相同年龄的人都进入相同的分区。 And, rdd2
is partitioned(by default) based on the name of the school.并且,
rdd2
根据学校名称进行分区(默认情况下)。
I want to rdd1.leftOuterJoin(rdd2)
in such a way that rdd1
doesn't get shuffled because rdd1 is very very big compared to rdd2.我想
rdd1.leftOuterJoin(rdd2)
以这样的方式rdd1
不会被洗牌,因为与 rdd2 相比,rdd1 非常非常大。 Also, I'm outputting the result to Cassandra which is partitioned on age
, so current partitioning of rdd1
will fasten the process of writing later.另外,我将结果输出到按
age
分区的 Cassandra,因此当前对rdd1
分区将加快后面的写入过程。
Is there a way to join there two RDDs without: 1. Shuffling rdd1
and 2. Broadcasting 'rdd2', because rdd2
is bigger than the available memory.有没有办法在没有以下情况下加入两个 RDD:1. 混洗
rdd1
和 2. 广播“rdd2”,因为rdd2
大于可用内存。
Note: The joined rdd should be partitioned based on age.注意:加入的 rdd 应该根据年龄进行分区。
Suppose you have two rdds, rdd1 and rdd2 and want to apply join operation.假设您有两个 rdds,rdd1 和 rdd2,并且想要应用连接操作。 and if rdds has partitioned (partition is set).
并且如果 rdds 已分区(设置了分区)。 then calling rdd3 = rdd1.join(rdd2) will make rdd3 partition by rdd1.
然后调用 rdd3 = rdd1.join(rdd2) 将使 rdd3 由 rdd1 分区。 rdd3 will always take hash partition from rdd1 (first parent, the one that join was called on).
rdd3 将始终从 rdd1 中获取哈希分区(第一个父级,调用 join 的那个)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.