简体   繁体   English

GraphFrame 最短路径因未找到 Vertex Id 错误而失败

[英]GraphFrame shortest path fails with Vertex Id not found error

My Spark/Databricks is using GraphFrames to find shortest paths between vertices inside a connected component.我的 Spark/Databricks 正在使用 GraphFrames 来查找连接组件内的顶点之间的最短路径。 Algorithm is failing after several minutes with org.graphframes.NoSuchVertexException: GraphFrame algorithm given vertex ID which does not exist in Graph. Vertex ID 1 not contained in GraphFrame算法在几分钟后因org.graphframes.NoSuchVertexException: GraphFrame algorithm given vertex ID which does not exist in Graph. Vertex ID 1 not contained in GraphFrame org.graphframes.NoSuchVertexException: GraphFrame algorithm given vertex ID which does not exist in Graph. Vertex ID 1 not contained in GraphFrame . org.graphframes.NoSuchVertexException: GraphFrame algorithm given vertex ID which does not exist in Graph. Vertex ID 1 not contained in GraphFrame

The error message is completely irrelevant - neither graph's vertices contain id == 1 nor edges src dst does.错误消息是完全不相关的——图的顶点既不包含id == 1也不包含边src dst The algorithm should not be looking for such id at all.该算法根本应该寻找这样的 id。 I'm wondering if there is any size driven limitation causing shortestPaths failure or if I'm missing any other part of definition.我想知道是否有任何大小驱动的限制导致shortestPaths失败,或者我是否缺少定义的任何其他部分。

Code is very simple:代码很简单:

val sp14754224 = g54.shortestPaths.landmarks("14754224").run

Graph structure quite basic too:图结构也很基本:

e54:org.apache.spark.sql.DataFrame
    src:integer
    dst:integer
    edgeRevenue:double
    edgeAgreements:double

v54:org.apache.spark.sql.DataFrame
    id:integer
    Name:string
    vertexRevenue:double
    vertexDealss:long

Graph itself is relatively large (31,342 vertices and 1,027,724 edges), but it's just subset of larger graph previously processed by connectedComponets .图本身相对较大(31,342 个顶点和 1,027,724 条边),但它只是之前由connectedComponets处理的较大图的子集。 There also seems to be no issues with memory consumption (observed peak was ~20GB while each worker has 64GB). memory 消耗似乎也没有问题(观察到的峰值约为 20GB,而每个工人有 64GB)。

Any recommendation?有什么推荐吗?

I believe landmarks is supposed to be a sequence, try this:我相信地标应该是一个序列,试试这个:

val sp14754224 = g54.shortestPaths.landmarks(Seq("14754224")).run val sp14754224 = g54.shortestPaths.landmarks(Seq("14754224")).run

I wonder if there is some conversion going on so your string is becoming a Seq[Char] perhaps, hence the vertex 1 error.我想知道是否正在进行一些转换,因此您的字符串可能会变成 Seq[Char],因此会出现顶点 1 错误。

Never found solution in Spark/Scala, but there is an easy work-around using Spark R:从未在 Spark/Scala 中找到解决方案,但使用 Spark R 可以轻松解决:

  • Switch to R (either created R notebook or use %r ; either way, input data has to be read from the storage)切换到 R(创建 R 笔记本或使用%r ;无论哪种方式,都必须从存储中读取输入数据)
  • Install iGraph and SparkR libraries to the clusteriGraphSparkR库安装到集群
  • Collect vertices and edges to get R data.frame收集顶点和边得到R data.frame
  • Apply iGraph methods (eg shortest path)应用iGraph方法(例如最短路径)

Graph data must fit in driver's memory which was the case for the task from my question.图形数据必须适合驱动程序的 memory,这就是我的问题中的任务。

Code Example:代码示例:

%r
if (require(SparkR) == FALSE) install.packages("SparkR")
if (require(igraph) == FALSE) install.packages("igraph")

processRoot <-  "abfss://yourAccount@fdestorageuat.dfs.core.windows.net/YourDataPath/"

#Data Intake
inEdgesPath     <- paste(processRoot, "STG_Edges/", sep = "")
inVerticesPath  <- paste(processRoot, "STG_Vertices/", sep = "")

inEdges     <- collect(read.parquet(inEdgesPath))
inVertices  <- collect(read.parquet(inVerticesPath))

#Graph Definition (inVertices is optional - adding to capture names size of vertices)
g <- graph_from_data_frame(validEdges, directed = FALSE, inVertices)

#Example of iGraph (defining connected components and communities)
clu <- components(g)
fg <- fastgreedy.community(g)

#Outputs from both commands have same order and lenght as vertices thus they could be added to vertices data
inVertices$componentId <- clu[["membership"]]
inVertices$communityId <- as.numeric(membership(fg))

iGraph has much broader functionality than GraphX/GraphFrame. iGraph具有比 GraphX/GraphFrame 更广泛的功能。 It also performs much faster as long as graph data fits in memory.只要图形数据适合 memory,它的执行速度也会快得多。 If graph should be too large, consider using GraphFrame's connected components first and then process each component separately calling iGraph's functionality thru gapply by each component id.如果图太大,请考虑先使用 GraphFrame 的连接组件,然后通过每个组件 id 分别调用gapply的功能来处理每个组件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM