简体   繁体   English

Spark局部变量广播到执行器

[英]Spark local variable broadcast to executor

var countryMap = Map("Amy" -> "Canada", "Sam" -> "US", "Bob" -> "Canada")
val names = List("Amy", "Sam", "Eric")
sc.parallelize(names).flatMap(broadcastMap.value.get).collect.foreach(println)

//output
Canada
US

I'm running this spark job in YARN mode, and I'm sure that driver and executors are not in the same node/JVM (see the attached pic).我在 YARN 模式下运行这个 spark 作业,我确信驱动程序和执行程序不在同一个节点/JVM 中(参见附图)。 Since countryMap is not a broadcast variable, so the executor should not see it and this code shouldn't print anything.由于 countryMap 不是广播变量,所以执行程序不应该看到它,并且这段代码不应该打印任何东西。 However, it printed Canada and US .但是,它打印了CanadaUS

My question is that does Spark populate local variables to executors automatically if they are serializable?我的问题是,如果局部变量是可序列化的,Spark 是否会自动将局部变量填充到执行程序? if not, how does the executor see the driver's local variables?如果没有,执行器如何查看驱动程序的局部变量?

在此处输入图像描述

Hay Edwards,海伊爱德华兹,

when you invoke collect that bring result set back to driver try to perform mapping.当您调用将结果集带回驱动程序的collect时,请尝试执行映射。 that the reason that you could find mappings get generated.你可以找到映射的原因是生成的。

Cheers,干杯,

local variables: The driver and each executor,no serialization required,Shared within the actuator/driver.局部变量:驱动程序和每个执行程序,无需序列化,在执行器/驱动程序内共享。 main variables: The driver and each copy(insulate) of the task,need serialization主要变量:驱动程序和任务的每个副本(绝缘),需要序列化

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark 动作通过一组阶段执行,由分布式“shuffle”操作分隔。 Spark automatically broadcasts the common data needed by tasks within each stage. Spark 自动广播每个阶段内任务所需的公共数据 The data broadcasted this way is cached in serialized form and deserialized before running each task.以这种方式广播的数据以序列化形式缓存,并在运行每个任务之前进行反序列化。 This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.这意味着显式创建广播变量仅在跨多个阶段的任务需要相同数据或以反序列化形式缓存数据很重要时才有用。

reference https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#broadcast-variables参考https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#broadcast-variables

Basically, local variables in the driver will be broadcast to executors automatically.基本上,驱动程序中的局部变量会自动广播给执行程序。 However, you need to create broadcast variables when you need them across different stages.但是,当您需要跨不同阶段的广播变量时,您需要创建它们。

To have broadcastMap.value.get function running on a cluster, Spark needs to serialize broadcastMap and send to every executor, so you have a function with data already attached to it in the form an object instance.要让broadcastMap.value.get function 在集群上运行,Spark 需要序列化broadcastMap并发送到每个执行程序,所以你有一个 function 数据已经以 ZA8CFDE63311BD59EB266ZC96B 实例的形式附加到它上面。 If you make broadcastMap class unserializable - you won't be able to run this code whatsoever.如果您使broadcastMap class 不可序列化 - 您将无法运行此代码。

So, Spark doesn't populate local variables to executors, but rather you explicitly tell it to serialize object broadcastMap and run a method of that object distributedly.因此,Spark 不会将局部变量填充到执行程序,而是您明确告诉它序列化 object broadcastMap映射并分布式运行该 object 的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM