简体   繁体   English

关于apache火花的脱节设置

[英]Disjoint sets on apache spark

I trying to find algorithm of searching disjoint sets (connected components/union-find) on large amount of data with apache spark. 我试图找到使用apache spark在大量数据上搜索不相交集(连接组件/ union-find)的算法。 Problem is amount of data. 问题是数据量。 Even Raw representation of graph vertex doesn't fit in to ram on single machine. 甚至图顶点的Raw表示也不适合单机上的ram。 Edges also doesn't fit in to the ram. 边缘也不适合公羊。

Source data is text file of graph edges on hdfs: "id1 \\t id2". 源数据是hdfs上图形边缘的文本文件:“id1 \\ t id2”。

id present as string value, not int. id作为字符串值存在,而不是int。

Naive solution that I found is: 我发现天真的解决方案是:

  1. take rdd of edges -> [id1:id2] [id3:id4] [id1:id3] 取rdd of edges - > [id1:id2] [id3:id4] [id1:id3]
  2. group edges by key. 按键分组边缘。 -> [id1:[id2;id3]][id3:[id4]] - > [id1:[id2;id3]][id3:[id4]]
  3. for each record set minimum id to each group -> (flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3] 为每个记录设置每组最小ID - > (flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3]
  4. reverse rdd from stage 3 [id2:id1] -> [id1:id2] 从第3阶段反转rdd [id2:id1] -> [id1:id2]
  5. leftOuterJoin of rdds from stage 3 and 4 leftOuterJoin of rdds from stage 3 and 4
  6. repeat from stage 2 while size of rdd on step 3 wouldn't change 从第2阶段开始重复,而第3步的rdd大小不会改变

But this results in the transfer of large amounts of data between nodes (shuffling) 但这导致节点之间传输大量数据(改组)

Any advices? 有什么建议吗?

If you are working with graphs I would suggest that you take a look at either one of these libraries 如果您正在使用图表,我建议您查看其中一个库

They both provide the connected components algorithm out of the box. 它们都提供开箱即用的连接组件算法。

GraphX : GraphX

val graph: Graph = ...
val cc = graph.connectedComponents().vertices

GraphFrames : GraphFrames

val graph: GraphFrame = ...
val cc = graph.connectedComponents.run()
cc.select("id", "component").orderBy("component").show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM