[英]Disjoint sets on apache spark
I trying to find algorithm of searching disjoint sets (connected components/union-find) on large amount of data with apache spark. 我试图找到使用apache spark在大量数据上搜索不相交集(连接组件/ union-find)的算法。 Problem is amount of data.
问题是数据量。 Even Raw representation of graph vertex doesn't fit in to ram on single machine.
甚至图顶点的Raw表示也不适合单机上的ram。 Edges also doesn't fit in to the ram.
边缘也不适合公羊。
Source data is text file of graph edges on hdfs: "id1 \\t id2". 源数据是hdfs上图形边缘的文本文件:“id1 \\ t id2”。
id present as string value, not int. id作为字符串值存在,而不是int。
Naive solution that I found is: 我发现天真的解决方案是:
[id1:id2] [id3:id4] [id1:id3]
[id1:id2] [id3:id4] [id1:id3]
[id1:[id2;id3]][id3:[id4]]
[id1:[id2;id3]][id3:[id4]]
(flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3]
(flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3]
[id2:id1] -> [id1:id2]
[id2:id1] -> [id1:id2]
leftOuterJoin
of rdds from stage 3 and 4 leftOuterJoin
of rdds from stage 3 and 4 But this results in the transfer of large amounts of data between nodes (shuffling) 但这导致节点之间传输大量数据(改组)
Any advices? 有什么建议吗?
If you are working with graphs I would suggest that you take a look at either one of these libraries 如果您正在使用图表,我建议您查看其中一个库
They both provide the connected components algorithm out of the box. 它们都提供开箱即用的连接组件算法。
GraphX : GraphX :
val graph: Graph = ...
val cc = graph.connectedComponents().vertices
GraphFrames : GraphFrames :
val graph: GraphFrame = ...
val cc = graph.connectedComponents.run()
cc.select("id", "component").orderBy("component").show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.