scala.MatchError on a tuple

Question

After processing some input data, I got a RDD[(String, String, Long)], say input , in hand.

input: org.apache.spark.rdd.RDD[(String, String, Long)] = MapPartitionsRDD[9] at flatMap at <console>:54

The string fields here represent vertices of graph and long field is the weight of the edge.

To create a graph out of this, first I am inserting vertices into a map with a unique id if vertex is not known already. If it was already encountered, I use the vertex id that was assigned previously. Essentially, each vertex is assigned a unique id of type Long and then I want to create Edges.

Here is what I am doing:

var vertexMap = collection.mutable.Map[String, Long]()
var vid : Long = 0          // global vertex id counter
var srcVid : Long = 0       // source vertex id
var dstVid : Long = 0       // destination vertex id

val graphEdges = input.map {
    case Row(src: String, dst: String, weight: Long) => (
        if (vertexMap.contains(src)) {
            srcVid = vertexMap(src)
            if (vertexMap.contains(dst)) {
                dstVid = vertexMap(dst)
            } else {
                vid += 1   // pick a new vertex id
                vertexMap += (dst -> vid)
                dstVid = vid
            }
            Edge(srcVid, dstVid, weight)
        } else {
            vid += 1
            vertexMap(src) = vid
            srcVid = vid
            if (vertexMap.contains(dst)) {
                dstVid = vertexMap(dst)
            } else {
                vid += 1
                vertexMap(dst) = vid
                dstVid = vid
            }
            Edge(srcVid, dstVid, weight)
        }
    }

val graph = Graph.fromEdges(graphEdges, 0)
println("num edges = " + graph.numEdges);
println("num vertices = " + graph.numVertices);

What I see is

graphEdges is of type RDD[org.apache.spark.graphx.Edge[Long]] and graph is of type Graph[Int,Long]

graphEdges: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Long]] = MapPartitionsRDD[10] at map at <console>:64
graph: org.apache.spark.graphx.Graph[Int,Long] = org.apache.spark.graphx.impl.GraphImpl@1b48170a

but I get the following error, while printing the graph's edge and vertex count.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 8.0 failed 1 times, most recent failure: Lost task 1.0 in stage 8.0 (TID 9, localhost, executor driver): ***scala.MatchError: (vertexA, vertexN, 2000
)*** (of class scala.Tuple3)
        at $anonfun$1.apply(<console>:64)
        at $anonfun$1.apply(<console>:64)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
        at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

I don't understand where is the mismatch here.

Thanks @Joe K for the helpful tip. I started using zipIndex and code looks compact now, however graph instantiation still fails. Here is the updated code:

val vertices = input.map(r => r._1).union(input.map(r => r._2)).distinct.zipWithIndex
val graphEdges = input.map {
    case (src, dst, weight) =>
        Edge(vertices.lookup(src)(0), vertices.lookup(dst)(0), weight)
}
val graph = Graph.fromEdges(graphEdges, 0)
println("num edges = " + graph.numEdges);

So, from the original 3-tuple, I am forming a union of 1st and 2nd tuples (which are vertices), then assigning unique Ids to each after uniquifying them. I am then using their ids, while creating edges. However, it fails with following exception:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 23, localhost, executor driver): org.apache.spark.SparkException: This RDD lacks
 a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed
inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
        at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:89)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:937)
        at $anonfun$1.apply(<console>:55)
        at $anonfun$1.apply(<console>:53)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)

Any thoughts ?

Answer 1

This specific error is coming from trying to match a tuple as a Row , which it is not.

Change:

case Row(src: String, dst: String, weight: Long) => {

to just:

case (src, dst, weight) => {

Also, your larger plan for generating vertex ids will not work. All of the logic inside the map will happen in parallel in different executors, which will have different copies of the mutable map.

You should flatMap your edges to get a list of all vertexes, then call .distinct.zipWithIndex to assign each vertex a single unique long value. You would then need to re-join with the original edges.

scala.MatchError on a tuple

Question

1 answers

solution1
2 ACCPTED 2017-10-03 20:01:35

scala.MatchError on a tuple

Question

1 answers

solution1 2 ACCPTED 2017-10-03 20:01:35

solution1
2 ACCPTED 2017-10-03 20:01:35