GraphFrame 最短路徑因未找到 Vertex Id 錯誤而失敗

Question

我的 Spark/Databricks 正在使用 GraphFrames 來查找連接組件內的頂點之間的最短路徑。 算法在幾分鍾后因org.graphframes.NoSuchVertexException: GraphFrame algorithm given vertex ID which does not exist in Graph. Vertex ID 1 not contained in GraphFrame org.graphframes.NoSuchVertexException: GraphFrame algorithm given vertex ID which does not exist in Graph. Vertex ID 1 not contained in GraphFrame 。

錯誤消息是完全不相關的——圖的頂點既不包含id == 1也不包含邊src dst 。 該算法根本不應該尋找這樣的 id。 我想知道是否有任何大小驅動的限制導致shortestPaths失敗，或者我是否缺少定義的任何其他部分。

代碼很簡單：

val sp14754224 = g54.shortestPaths.landmarks("14754224").run

圖結構也很基本：

e54:org.apache.spark.sql.DataFrame
    src:integer
    dst:integer
    edgeRevenue:double
    edgeAgreements:double

v54:org.apache.spark.sql.DataFrame
    id:integer
    Name:string
    vertexRevenue:double
    vertexDealss:long

圖本身相對較大（31,342 個頂點和 1,027,724 條邊），但它只是之前由connectedComponets處理的較大圖的子集。 memory 消耗似乎也沒有問題（觀察到的峰值約為 20GB，而每個工人有 64GB）。

有什么推薦嗎？

Answer 1

我相信地標應該是一個序列，試試這個：

val sp14754224 = g54.shortestPaths.landmarks(Seq("14754224")).run

我想知道是否正在進行一些轉換，因此您的字符串可能會變成 Seq[Char]，因此會出現頂點 1 錯誤。

Answer 2

從未在 Spark/Scala 中找到解決方案，但使用 Spark R 可以輕松解決：

切換到 R（創建 R 筆記本或使用%r ；無論哪種方式，都必須從存儲中讀取輸入數據）
將iGraph和SparkR庫安裝到集群
收集頂點和邊得到R data.frame
應用iGraph方法（例如最短路徑）

圖形數據必須適合驅動程序的 memory，這就是我的問題中的任務。

代碼示例：

%r
if (require(SparkR) == FALSE) install.packages("SparkR")
if (require(igraph) == FALSE) install.packages("igraph")

processRoot <-  "abfss://yourAccount@fdestorageuat.dfs.core.windows.net/YourDataPath/"

#Data Intake
inEdgesPath     <- paste(processRoot, "STG_Edges/", sep = "")
inVerticesPath  <- paste(processRoot, "STG_Vertices/", sep = "")

inEdges     <- collect(read.parquet(inEdgesPath))
inVertices  <- collect(read.parquet(inVerticesPath))

#Graph Definition (inVertices is optional - adding to capture names size of vertices)
g <- graph_from_data_frame(validEdges, directed = FALSE, inVertices)

#Example of iGraph (defining connected components and communities)
clu <- components(g)
fg <- fastgreedy.community(g)

#Outputs from both commands have same order and lenght as vertices thus they could be added to vertices data
inVertices$componentId <- clu[["membership"]]
inVertices$communityId <- as.numeric(membership(fg))

iGraph具有比 GraphX/GraphFrame 更廣泛的功能。 只要圖形數據適合 memory，它的執行速度也會快得多。 如果圖太大，請考慮先使用 GraphFrame 的連接組件，然后通過每個組件 id 分別調用gapply的功能來處理每個組件。

GraphFrame 最短路徑因未找到 Vertex Id 錯誤而失敗

問題描述

2 個解決方案

解決方案1
1 已采納 2020-09-08 22:23:28

解決方案2
0 2020-09-08 17:55:07

GraphFrame 最短路徑因未找到 Vertex Id 錯誤而失敗

問題描述

2 個解決方案

解決方案1 1 已采納 2020-09-08 22:23:28

解決方案2 0 2020-09-08 17:55:07

解決方案1
1 已采納 2020-09-08 22:23:28

解決方案2
0 2020-09-08 17:55:07