[英]StackOverflowError when doing iterative computing using Apache-Spark
如果RDD對象具有非空.dependencies
,這是否意味着它具有譜系? 我該如何刪除?
我正在進行迭代計算,每個迭代取決於上一次迭代中計算的結果。 經過幾次迭代后,它將拋出StackOverflowError
。
最初,我嘗試使用cache
,我讀了pregel.scala
一部分的GraphX
的代碼,它們使用count
方法在cache
后實現對象,但是我附加了調試器,看來這種方法並不為空.dependencies
,這在我的代碼中也不起作用。
另一種替代方法是使用checkpoint
,我為Graph
對象嘗試了checkpoint
頂點和邊,然后通過count
頂點和邊來實現它。 然后,我使用.isCheckpointed
來檢查它是否正確地建立了檢查點,但是它始終返回false。
更新我編寫了簡化版本的代碼,可以重現該問題。
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("HDTM")
.setMaster("local[4]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "edu.nd.dsg.hdtm.util.HDTMKryoRegistrator")
val sc = new SparkContext(conf)
val v = sc.parallelize(Seq[(VertexId, Long)]((0L, 0L), (1L, 1L), (2L, 2L)))
val e = sc.parallelize(Seq[Edge[Long]](Edge(0L, 1L, 0L), Edge(1L, 2L, 1L), Edge(2L, 0L, 2L)))
val newGraph = Graph(v, e)
var currentGraph = newGraph
val vertexIds = currentGraph.vertices.map(_._1).collect()
for (i <- 1 to 1000) {
var g = currentGraph
vertexIds.toStream.foreach(id => {
g = Graph(currentGraph.vertices, currentGraph.edges)
g.cache()
g.edges.cache()
g.vertices.cache()
g.vertices.count()
g.edges.count()
})
currentGraph.unpersistVertices(blocking = false)
currentGraph.edges.unpersist(blocking = false)
currentGraph = g
println(" iter "+i+" finished")
}
}
更新
這是代碼,我刪除了大多數不必要的方法,以使代碼行最小化,但是如果考慮其功能,這可能沒有意義。
object StackOverFlow {
final val PATH = "./"
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("HDTM")
.setMaster("local[4]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "edu.nd.dsg.hdtm.util.HDTMKryoRegistrator")
val sc = new SparkContext(conf)
val filePath = PATH + "src/test/resources/binary.txt"
val wikiGraph: Graph[WikiDataVertex, Double] = WikiGraphLoader.loadGraphFromTestHDTMFile(sc, filePath)
wikiGraph.cache()
val root = 0L
val bfsGraph = GraphAlgorithm.initializeGraph(wikiGraph, root, sc)
bfsGraph.cache()
val vertexIds = bfsGraph.vertices.map(_._1).collect()
var currentGraph = bfsGraph
for (i <- 1 to 1000) {
var g = currentGraph
vertexIds.toStream.foreach(id => {
g = samplePath(g, id, root)
})
currentGraph.unpersistVertices(blocking = false)
currentGraph.edges.unpersist(blocking = false)
currentGraph = g
println(" iter "+i+" finished")
}
}
def samplePath[ED: ClassTag](graph: Graph[WikiDataVertex, ED],
instance: VertexId, root: VertexId): Graph[WikiDataVertex, ED] = {
if(instance == 0L) return graph
val (removedGraph, remainedGraph) = splitGraph(graph, instance)
/**
* Here I omit some other code, which will change the attributes of removedGraph and remainedGraph
*/
val newVertices = graph.outerJoinVertices(removedGraph.vertices ++ remainedGraph.vertices)({
(vid, vd, opt) => {
opt.getOrElse(vd)
}
}).vertices
val newEdges = graph.edges.map(edge => {
if (edge.dstId == instance)
edge.copy(srcId = edge.srcId)
// In the real case edge.srcId will be replaced by an vertexId calculated by other functions
else
edge.copy()
})
val g = Graph(newVertices, newEdges)
g.vertices.cache()
g.edges.cache()
g.cache()
g.vertices.count()
g.edges.count()
remainedGraph.unpersistVertices(blocking = false)
remainedGraph.edges.unpersist(blocking = false)
removedGraph.unpersistVertices(blocking = false)
removedGraph.edges.unpersist(blocking = false)
g
}
/**
* Split a graph into two sub-graph by an vertex `instance`
* The edge that ends at `instance` will lose
* @param graph Graph that will be separated
* @param instance Vertex that we are using to separate the graph
* @tparam ED Edge type
* @return (sub-graph with `instance`, sub-graph without `instance`)
**/
def splitGraph[ED: ClassTag]
(graph: Graph[WikiDataVertex, ED], instance: VertexId): (Graph[WikiDataVertex, ED], Graph[WikiDataVertex,ED]) = {
val nGraph = GraphAlgorithm.graphWithOutDegree(graph)
// This will need twice, cache it to prevent re-computation
nGraph.cache()
val wGraph = nGraph.subgraph(epred = e => e.dstAttr._1.path.contains(instance) ||
e.srcAttr._1.path.contains(instance),
vpred = (id, vd) => vd._1.path.contains(instance))
val woGraph = nGraph.subgraph(epred = e => !e.dstAttr._1.path.contains(instance) &&
!e.srcAttr._1.path.contains(instance),
vpred = (id, vd) => !vd._1.path.contains(instance))
val removedGraph = Graph(wGraph.vertices.mapValues(_._1), wGraph.edges, null)
val remainedGraph = Graph(woGraph.vertices.mapValues(_._1), woGraph.edges, null)
removedGraph.vertices.count()
removedGraph.edges.count()
removedGraph.cache()
remainedGraph.vertices.count()
remainedGraph.edges.count()
remainedGraph.cache()
nGraph.unpersistVertices(blocking = false)
nGraph.edges.unpersist(blocking = false)
(removedGraph, remainedGraph)
}
}
在最初的10次迭代中,它運行速度很快,此后每次迭代都花費更多時間。 我檢查了Spark WebUI,每個操作的實際執行時間幾乎相同,但是隨着迭代次數的增加, Scheduler Delay
也會增加。 並且在經過20次迭代后,它將拋出StackOverflowError。
val g = loadEdgeFile(sc, edge_pt, n_partition)
g.edges.foreachPartition(_ => Unit)
g.vertices.foreachPartition(_ => Unit)
g.checkpoint()
g.edges.foreachPartition(_ => Unit)
g.vertices.foreachPartition(_ => Unit)
println(s"is cp: ${g.isCheckpointed}"
為了獲得圖形檢查點,它應滿足三個條件:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.