关于apache火花的脱节设置

Question

I trying to find algorithm of searching disjoint sets (connected components/union-find) on large amount of data with apache spark. 我试图找到使用apache spark在大量数据上搜索不相交集（连接组件/ union-find）的算法。 Problem is amount of data. 问题是数据量。 Even Raw representation of graph vertex doesn't fit in to ram on single machine. 甚至图顶点的Raw表示也不适合单机上的ram。 Edges also doesn't fit in to the ram. 边缘也不适合公羊。

Source data is text file of graph edges on hdfs: "id1 \\t id2". 源数据是hdfs上图形边缘的文本文件：“id1 \\ t id2”。

id present as string value, not int. id作为字符串值存在，而不是int。

Naive solution that I found is: 我发现天真的解决方案是：

take rdd of edges -> [id1:id2] [id3:id4] [id1:id3] 取rdd of edges - > [id1:id2] [id3:id4] [id1:id3]
group edges by key. 按键分组边缘。 -> [id1:[id2;id3]][id3:[id4]] - > [id1:[id2;id3]][id3:[id4]]
for each record set minimum id to each group -> (flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3] 为每个记录设置每组最小ID - > (flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3]
reverse rdd from stage 3 [id2:id1] -> [id1:id2] 从第3阶段反转rdd [id2:id1] -> [id1:id2]
leftOuterJoin of rdds from stage 3 and 4 leftOuterJoin of rdds from stage 3 and 4
repeat from stage 2 while size of rdd on step 3 wouldn't change 从第2阶段开始重复，而第3步的rdd大小不会改变

But this results in the transfer of large amounts of data between nodes (shuffling) 但这导致节点之间传输大量数据（改组）

Any advices? 有什么建议吗？

Answer 1

If you are working with graphs I would suggest that you take a look at either one of these libraries 如果您正在使用图表，我建议您查看其中一个库

GraphX GraphX
GraphFrames GraphFrames

They both provide the connected components algorithm out of the box. 它们都提供开箱即用的连接组件算法。

GraphX : GraphX ：

val graph: Graph = ...
val cc = graph.connectedComponents().vertices

GraphFrames : GraphFrames ：

val graph: GraphFrame = ...
val cc = graph.connectedComponents.run()
cc.select("id", "component").orderBy("component").show()

关于apache火花的脱节设置

问题描述

1 个解决方案

解决方案1
0 2017-06-14 14:28:10

关于apache火花的脱节设置

问题描述

1 个解决方案

解决方案1 0 2017-06-14 14:28:10

解决方案1
0 2017-06-14 14:28:10