简体   繁体   English

在GraphX中删除没有出线边缘的顶点

[英]Remove Vertices with no outgoing edges in GraphX

I have a big Graph (a few million vertices and edges). 我有一个大图(几百万个顶点和边)。 I want to remove all the vertices (& edges) which has no outgoing edges. 我要删除所有没有出线边缘的顶点(和边缘)。 I have some code that works but it is slow and I need to do it several times. 我有一些可以工作的代码,但是它很慢,我需要做几次。 I am sure I can use some existing GraphX method to make it much faster. 我确信我可以使用一些现有的GraphX方法来使其更快。

This is the code I have. 这是我的代码。

val users: RDD[(VertexId, String)] = sc.parallelize(Array((1L, "1"), (2L, "2"), (3L, "3"), (4L, "4")))
  val relationships: RDD[Edge[Double]] = sc.parallelize(
    Array(
      Edge(1L, 3L, 500.0),
      Edge(3L, 2L, 400.0),
      Edge(2L, 1L, 600.0),
      Edge(3L, 1L, 200.0),
      Edge(2L, 4L, 200.0),
      Edge(3L, 4L, 500.0)
    ))

val graph = org.apache.spark.graphx.Graph(users, relationships)

val lst = graph.outDegrees.map(x => x._1).collect
var set:scala.collection.mutable.HashSet[Long] = new scala.collection.mutable.HashSet()
for(a<- lst) {set.add(a)}
var subg = graph.subgraph(vpred = (id, attr) => set.contains(id))
//since vertex 4 has no outgoing edges, subg.edges should return 4 and subg.vertices = 3 

I don't know how else this can be achieved. 我不知道还有什么可以实现的。 Any help is appreciated! 任何帮助表示赞赏!

EDIT: I could do it with HashSet but I think it can still be improved. 编辑:我可以用HashSet做到这一点,但我认为它仍然可以改进。

A first optimization to your code is to have lst be a set rather than an array, which would make the lookup O(1) rather than O(n) 代码的第一个优化是将lst作为一个集合而不是一个数组,这将使查找为O(1)而不是O(n)

But this is not scalable since you are collecting everything on the driver then sending it back to the executors. 但这是不可扩展的,因为您要收集驱动程序上的所有内容,然后将其发送回执行者。 The right way would be to call joinVertices with outDegrees and just map to the original graph. 正确的方法是调用joinVerticesoutDegrees ,只是映射到原图。

You could directly define another graph with the filtered vertices. 您可以使用过滤后的顶点直接定义另一个图。 Something like this: 像这样:

val lst = graph.outDegrees.map(x => x._1).collect
var graph2 = Graph(graph.vertices.filter(v => lst.contains(v)), graph.edges)

If you do not want to use subgraph, here is another way using triplets to find those destination vertices which are also source vertices. 如果您不想使用子图,则这是使用三胞胎查找那些也是源顶点的目标顶点的另一种方法。

val graph = org.apache.spark.graphx.Graph(users, relationships)
val AsSubjects = graph.triplets.map(triplet => (triplet.srcId,(triplet)))
val AsObjects = graph.triplets.map(triplet => (triplet.dstId,(triplet)))
val ObjectsJoinSubjects = AsObjects.join(AsSubjects)
val ObjectsJoinSubjectsDistinct = ObjectsJoinSubjects.mapValues(x => x._1).distinct()
val NewVertices = ObjectsJoinSubjectsDistinct.map(x => (x._2.srcId, x._2.srcAttr)).distinct()
val NewEdges = ObjectsJoinSubjectsDistinct.map(x => new Edge(x._2.srcId, x._2.dstId, x._2.attr))
val newgraph = Graph(NewVertices,NewEdges)

I am not sure if this provides an improvement over subgraph because my solution uses distinct() which is expensive. 我不确定这是否可以对子图进行改进,因为我的解决方案使用的昂贵的distinct()。 I tested with the graph you have provided and my solution actually takes longer. 我用您提供的图形进行了测试,而我的解决方案实际上需要更长的时间。 However, I feel that this is a small example. 但是,我觉得这只是一个小例子。 Therefore, I would suggest that you test with a larger graph and let us know if this works better. 因此,我建议您使用更大的图表进行测试,并让我们知道它是否更好。

You could you this to find all the zero outdegree verices. 您可以以此找到所有零度顶点。

val zeroOutDeg = graph.filter(graph => {
   val degrees: VertexRDD[Int] = graph.outDegrees
   graph.outerJoinVertices(degrees) {(vid, data, deg => deg.getOrElse(0)}
   }, vpred = (vid: VertexId, deg:Int) => deg == 0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM