简体   繁体   English

元组上的scala.MatchError

[英]scala.MatchError on a tuple

After processing some input data, I got a RDD[(String, String, Long)], say input , in hand. 处理了一些输入数据后,我得到了RDD [(String,String,Long)],比如input

input: org.apache.spark.rdd.RDD[(String, String, Long)] = MapPartitionsRDD[9] at flatMap at <console>:54

The string fields here represent vertices of graph and long field is the weight of the edge. 此处的字符串字段表示图形的顶点,而长字段表示边缘的权重。

To create a graph out of this, first I am inserting vertices into a map with a unique id if vertex is not known already. 要以此创建图,首先,如果顶点未知,我将顶点插入具有唯一ID的地图中。 If it was already encountered, I use the vertex id that was assigned previously. 如果已经遇到,则使用先前分配的顶点ID。 Essentially, each vertex is assigned a unique id of type Long and then I want to create Edges. 本质上,每个顶点都被分配了一个Long类型的唯一ID,然后我要创建Edges。

Here is what I am doing: 这是我在做什么:

var vertexMap = collection.mutable.Map[String, Long]()
var vid : Long = 0          // global vertex id counter
var srcVid : Long = 0       // source vertex id
var dstVid : Long = 0       // destination vertex id

val graphEdges = input.map {
    case Row(src: String, dst: String, weight: Long) => (
        if (vertexMap.contains(src)) {
            srcVid = vertexMap(src)
            if (vertexMap.contains(dst)) {
                dstVid = vertexMap(dst)
            } else {
                vid += 1   // pick a new vertex id
                vertexMap += (dst -> vid)
                dstVid = vid
            }
            Edge(srcVid, dstVid, weight)
        } else {
            vid += 1
            vertexMap(src) = vid
            srcVid = vid
            if (vertexMap.contains(dst)) {
                dstVid = vertexMap(dst)
            } else {
                vid += 1
                vertexMap(dst) = vid
                dstVid = vid
            }
            Edge(srcVid, dstVid, weight)
        }
    }

val graph = Graph.fromEdges(graphEdges, 0)
println("num edges = " + graph.numEdges);
println("num vertices = " + graph.numVertices);

What I see is 我看到的是

graphEdges is of type RDD[org.apache.spark.graphx.Edge[Long]] and graph is of type Graph[Int,Long] graphEdges的类型为RDD [org.apache.spark.graphx.Edge [Long]],图的类型为Graph [Int,Long]

graphEdges: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Long]] = MapPartitionsRDD[10] at map at <console>:64
graph: org.apache.spark.graphx.Graph[Int,Long] = org.apache.spark.graphx.impl.GraphImpl@1b48170a

but I get the following error, while printing the graph's edge and vertex count. 但是在打印图形的边和顶点数时出现以下错误。

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 8.0 failed 1 times, most recent failure: Lost task 1.0 in stage 8.0 (TID 9, localhost, executor driver): ***scala.MatchError: (vertexA, vertexN, 2000
)*** (of class scala.Tuple3)
        at $anonfun$1.apply(<console>:64)
        at $anonfun$1.apply(<console>:64)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
        at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

I don't understand where is the mismatch here. 我不知道这里的不匹配在哪里。

Thanks @Joe K for the helpful tip. 感谢@Joe K提供有用的提示。 I started using zipIndex and code looks compact now, however graph instantiation still fails. 我开始使用zipIndex,现在代码看起来很紧凑,但是图形实例化仍然失败。 Here is the updated code: 这是更新的代码:

val vertices = input.map(r => r._1).union(input.map(r => r._2)).distinct.zipWithIndex
val graphEdges = input.map {
    case (src, dst, weight) =>
        Edge(vertices.lookup(src)(0), vertices.lookup(dst)(0), weight)
}
val graph = Graph.fromEdges(graphEdges, 0)
println("num edges = " + graph.numEdges);

So, from the original 3-tuple, I am forming a union of 1st and 2nd tuples (which are vertices), then assigning unique Ids to each after uniquifying them. 因此,从原始的三元组中,我形成了第一和第二元组(即顶点)的并集,然后在对它们进行唯一化后为它们分配唯一的ID。 I am then using their ids, while creating edges. 然后,在创建边缘时使用它们的ID。 However, it fails with following exception: 但是,它失败并带有以下异常:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 23, localhost, executor driver): org.apache.spark.SparkException: This RDD lacks
 a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed
inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
        at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:89)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:937)
        at $anonfun$1.apply(<console>:55)
        at $anonfun$1.apply(<console>:53)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)

Any thoughts ? 有什么想法吗 ?

This specific error is coming from trying to match a tuple as a Row , which it is not. 这种特定的错误来自尝试将元组匹配为Row ,而不是。

Change: 更改:

case Row(src: String, dst: String, weight: Long) => {

to just: 只是:

case (src, dst, weight) => {

Also, your larger plan for generating vertex ids will not work. 另外,您用于生成顶点ID的较大计划将不起作用。 All of the logic inside the map will happen in parallel in different executors, which will have different copies of the mutable map. map内的所有逻辑将在不同的执行程序中并行发生,执行程序将具有可变映射的不同副本。

You should flatMap your edges to get a list of all vertexes, then call .distinct.zipWithIndex to assign each vertex a single unique long value. 您应该对边缘进行flatMap以获取所有顶点的列表,然后调用.distinct.zipWithIndex为每个顶点分配一个唯一的长值。 You would then need to re-join with the original edges. 然后,您需要重新与原始边缘合并。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM