简体   繁体   English

Scala / Spark:为什么在本地和群集中使用广播运行spark程序时为什么会得到不同的结果?

[英]Scala/Spark: Why I get different result when I run spark program in local and cluster using broadcast?

I have a DataFrame,I want to get the the previous partition's value,I use broadcast.This is my code: 我有一个DataFrame,我想获取先前分区的值,我使用广播。这是我的代码:

val arr = Array((1, 1,1), (7, 2,1), (3, 3,2), (5, 4,2), (7, 5,3), (9, 6,3), (7, 7,4), (9, 8,4))
    var rdd = sc.parallelize(arr, 4)
    val bro=sc.broadcast(new mutable.HashMap[Int,Int])
     rdd=rdd.mapPartitionsWithIndex(
         (partIdx, iter) => {
           val iterArray=iter.toArray
           bro.value+=(partIdx->iterArray.last._1)
           iterArray.toIterator
         })
   rdd=rdd.mapPartitionsWithIndex(
     (partIdx, iter) => {
       val iterArray = iter.toArray
       var flag=true
       if(partIdx!=0) {
         while (flag) {
           if (bro.value.contains(partIdx - 1)) {
             flag = false
           }
         }
         println(bro.value.get(partIdx-1).get)
       }

       iter
     })
rdd.collect()

In first mapPartitionsWithIndex function I put the each partition'value to broadcast, in second mapPartitionsWithIndex function, I get the broadcast'value. 在第一个mapPartitionsWithIndex函数中,我将每个分区的值广播;在第二个mapPartitionsWithIndex函数中,我得到广播的值。 The code run in local well, but it does not work in the cluster, the program can not get the previous partition's value, Why I get the different result when I run spark program in local and cluster using broadcast? 该代码在本地运行良好,但是在群集中不起作用,该程序无法获取先前分区的值,为什么在本地和群集中使用广播运行spark程序时却得到不同的结果?

You get different results because your code is incorrect. 您得到不同的结果,因为您的代码不正确。 Broadcasted objects must not be modified : Broadcasted对象不得修改

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. 广播变量使程序员可以在每台计算机上保留一个只读变量,而不用随任务一起发送它的副本。

It seems to work because you take advantage of a detail of the implementation of local mode, with all threads running in a single machine. 之所以可行,是因为您利用了local模式的实现细节,所有线程都在一台机器上运行。 This makes it similar to the mistakes described in understanding closures . 这使其类似于理解闭包中描述的错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么我在Scala Spark中得到类型不匹配的信息? - Why I get type mismatch in scala Spark? 为什么我无法在 EMR 上的 Spark scala 中打开本地文件 - why I can't open local file in Spark scala on EMR 当我尝试在 Scala 中运行 John Snow spark-nlp 示例时,出现“任务不可序列化” - I get 'Task not serializable' when I try to run the John Snow spark-nlp example in Scala 在Hadoop上使用Spark运行Scala程序 - Run Scala Program with Spark on Hadoop 在集群中运行时,Spark Scala FoldLeft导致StackOverflow - Spark Scala FoldLeft resulting in StackOverflow when run in the cluster Spark Scala Seq 追加。 我应该广播变量吗? - Spark Scala Seq append. Should I broadcast the variable? 我如何以编程方式知道我的 spark 程序是在本地还是集群模式下运行? - How can I know programmatically if my spark program is running in local or cluster mode? 使用Spark REPL和独立Spark程序时的不同行为 - Different behavior when using Spark REPL and standalone Spark program 如何在 scala spark 中获取结果体的值而不是完整的结果体 - How do I get the value of the result body rather than full result body in scala spark 为什么Spark在本地模式下失败并且“无法获得broadcast_0的broadcast_0_piece0”? - Why does Spark fail with “Failed to get broadcast_0_piece0 of broadcast_0” in local mode?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM