[英]Spark join - (edges and vertices)
I have vertexRDD
which has 2 columns我有 2 列的vertexRDD
(vertexId, uniqueVertexId)
(V1, 1L)
(V2, 2L)
(V3, 3L)
(V4, 4L)
And I also have edgeRDD
我也有edgeRDD
(srcId, destId)
(V1, V2)
(V2, V3)
(V1, V4)
How can I join them in spark so the edges RDD will be like below我怎样才能加入他们的火花,这样边缘 RDD 就会像下面这样
(srcId, destId, uniqueSrcId, uniqueDestId)
(V1, V2, 1L, 2L)
(V2, V3, 2L, 3L)
(V1, V4, 1L, 4L)
I tried different joins but I couldn't really achieve the expected output. Appreciate any help.我尝试了不同的加入,但我无法真正达到预期的 output。感谢任何帮助。
I will use Java but I guess it is straightforward to convert it to Scala.我将使用 Java,但我想将其转换为 Scala 很简单。
Assuming假设edgeRDD
has type JavaPairRDD<String,String>
and edgeRDD
具有类型JavaPairRDD<String,String>
和vertexRDD
has type JavaPairRDD<String,Long>
: vertexRDD
的类型为JavaPairRDD<String,Long>
:
edgeRDD.join(vertexRDD)
will yield JavaPairRDD<String,Tuple2<String,Long>>
with the following content (let's call it join1
): edgeRDD.join(vertexRDD)
将产生具有以下内容的JavaPairRDD<String,Tuple2<String,Long>>
(我们称之为join1
):
(V1, Tuple2(V2,1L)) (V2, Tuple2(V3,2L)) (V1, Tuple2(V4,1L))
Then you convert join1
into another JavaPairRDD<String,Tuple2<String,Long>>
by restructuring the keys and values using map (let's call it join2
):然后,通过使用 map(我们称之为join2
)重构键和值,将join1
转换为另一个JavaPairRDD<String,Tuple2<String,Long>>
:
(V2, Tuple2(V1,1L)) (V3, Tuple2(V2,2L)) (V4, Tuple2(V1,1L))
Finally perform vertexRDD.join(join2)
to get JavaPairRDD<String,Tuple2<Long,Tuple2<String,Long>>>
with contents:最后执行vertexRDD.join(join2)
得到JavaPairRDD<String,Tuple2<Long,Tuple2<String,Long>>>
内容:
(V2, Tuple2(2L, Tuple2(V1,1L))) (V3, Tuple2(3L, Tuple2(V2,2L))) (V4, Tuple2(4L, Tuple2(V1,1L)))
which you may pass through the map and create JavaRDD<String>
(or a new JavaPairRDD
) by combining keys and values appropriately within the map. I will leave mapping phases up to you.您可以通过 map 并通过在 map 中适当地组合键和值来创建JavaRDD<String>
(或新的JavaPairRDD
)。我将把映射阶段留给您。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.