简体   繁体   English

Spark的新功能,使用graphx图形进行映射-NullPointerException

[英]New to Spark, mapping with graphx graphs - NullPointerException

My goal is to count triangles in multiple subgraphs from a common full graph. 我的目标是从一个共同的完整图中计算多个子图中的三角形。 The subgraph is defined by a constant set of nodes + a node from an RDD[Long]. 子图由一组恒定的节点+ RDD [Long]中的一个节点定义。 I'm new to spark/graphx, so this may be an improper use of map. 我是Spark / graphx的新手,所以这可能是地图的不正确使用。 The code I'm sharing will reproduce my error. 我共享的代码将重现我的错误。

To start, I have a subgraph of a full graph declared as shown below 首先,我声明了完整图的子图,如下所示

import org.apache.spark.rdd._
import org.apache.spark.graphx._
val nodes: RDD[(VertexId, String)] = sc.parallelize(Array((3L, "3"), (7L, "7"), (5L, "5"), (2L, "2"),(4L,"4")))
val vertices: RDD[Edge[String]] = sc.parallelize(Array(Edge(3L, 7L, "a"), Edge(3L, 5L, "b"), Edge(2L, 5L, "c"), Edge(5L, 7L, "d"), Edge(2L, 7L, "e"),Edge(4L,5L,"f")))
val graph: Graph[String,String] = Graph(nodes, vertices, "z")

val baseNodes: Array[Long] = Array(2L,5L,7L)    
val subgraph = graph.subgraph(vpred = (vid,attr)=> baseNodes contains vid)

Then I declare an RDD[Long] of other nodes from the graph. 然后,从图中声明其他节点的RDD [Long]。

val testNodes: RDD[Long] = sc.parallelize(Array(3L,4L))

I want to add each testNode to the subgraph and count the triangles present at testNode. 我想将每个testNode添加到子图并计算出现在testNode上的三角形。

val triangles: RDD[(Long,Int)] = testNodes.map{ newNode =>
  val newNodes: Array[Long] = baseNodes :+ newNode
  val newSubgraph = graph.subgraph(vpred = (vid,attr)=> newNodes contains vid)
  (newNode,findTriangles(7L,newSubgraph))
}
triangles.foreach(x=>x.toString)

My findTriangles works fine if I call it outside of the map function. 如果我在map函数外部调用它,我的findTriangles可以正常工作。

def findTriangles(id:Long,subgraph:Graph[String,String]): Int = {
  val triCounts = subgraph.triangleCount().vertices
  val count:Int = triCounts.filter{case(item,count)=> {item.toInt == id}}.map{case(item,count)=>count}.first
  count
}
val triangles = findTriangles(7L,subgraph) //1

But when I run my map function to calculate triangles, I get a NullPointerException. 但是,当我运行地图函数来计算三角形时,我得到了NullPointerException。 I think the problem is in using my graph val inside the mapping function. 我认为问题出在映射函数内部使用我的图形val。 Is that the issue? 那是问题吗? Is there a way to workaround this? 有办法解决此问题吗?

I think that the issue should be the baseNodes variable. 我认为问题应该是baseNodes变量。 Variables that are declared locally, such as baseNodes in your example, are only visible in the Spark driver, not in the executors that actually execute transformations and actions. 在本地声明的变量(例如您的示例中的baseNodes)仅在Spark驱动程序中可见,而在实际执行转换和操作的执行程序中则不可见。 To avoid the NullPointerException, you need to parallelize any variable that you'll need in the transformations (like map) that are executed on the executors. 为了避免NullPointerException,您需要并行化在执行程序上执行的转换(如map)中所需的任何变量。 As an alternative, if the variable you have is read-only, you can broadcast that variable to executors using the broadcast construct in Spark. 或者,如果您拥有的变量是只读的,则可以使用Spark中的广播构造将该变量广播给执行者。 In your case, it seems that baseNodes doesn't get modified within the map operation, so it's a good candidate to be broadcast instead of parallelized. 在您的情况下,似乎不会在map操作中修改baseNodes,因此它是广播而不是并行化的一个很好的候选者。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM