简体   繁体   English

Scala值未存储在spark中

[英]Scala value is not stored in spark

var cnt = 0
val newRDD = oldRDD.map({ list =>
        ...some work
        cnt = cnt + 1
        println(cnt) // print 1, 2, 3.. well
        newList // will store in new RDD
    })

//out side of map
println(cnt) // It's 0. Why?

As log, print inside of map is first and then print out side of map. 作为日志,首先打印在地图内部,然后再打印在地图的侧面。 Why isn't cnt value stored? 为什么不存储cnt值?

Spark transformations should be pure functions - receive input, produce output, without changing state or having any side-effects . Spark转换应该是纯函数 -接收输入,产生输出,不改变状态或没有任何副作用 Your example violates these. 您的示例违反了这些规定。

What happens here is: 这里发生的是:

  • The anonymous function passed as argument to map is serialized and sent to workers 作为参数传递给map的匿名函数被序列化并发送给worker
  • The initial value of cnt is serialized with it cnt的初始值序列化
  • cnt is 0 when deserialized on each worker 在每个工人上反序列化时cnt为0
  • Now, each worker increments cnt locally 现在,每个工人在本地增加cnt
  • That's it... the cnt value in driver application stays unchanged 就这样... 驱动程序应用程序中的cnt值保持不变

As an alternative, you can use Spark's Accumulators to achieve these types of "counters". 或者,您可以使用Spark的累加器来实现这些类型的“计数器”。

In the Spark framework, when you use an external variable in a closure, it is sent automatically to the worker nodes. 在Spark框架中,当您在闭包中使用外部变量时,它会自动发送到工作程序节点。 Each tasks get a new copy of the variable but if you update the variable in the task (this is what happens with your code), the framework does not send it back and synchronize it with the rest of the program, because it would be too expensive to do so. 每个任务都会获得该变量的新副本,但是如果您在任务中更新变量(代码就是这种情况),那么框架不会将其发送回并与程序的其余部分同步,因为它也是如此这样做很昂贵。

If you're using external variables in the closures, you can think of it as read-only variables. 如果在闭包中使用外部变量,则可以将其视为只读变量。

If you're trying to count how many elements you mapped, you can use oldRDD.count() / newRDD.count() in the first place (since it seems you don't filter elements, it should give the same result). 如果要计算要映射的元素数,则可以首先使用oldRDD.count() / newRDD.count() (因为似乎没有过滤元素,所以应该给出相同的结果)。

我认为您应该先检查一下以了解Spark中的共享变量: http : //spark.apache.org/docs/latest/programming-guide.html#shared-variables

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM