仅使用 Spark-SQL API 时广播变量的使用

Question

When using the Spark-RDD API, we can use broadcast-variables to optimize the way spark distributes immutable state.在使用 Spark-RDD API 时，我们可以使用广播变量来优化 Spark 分发不可变状态的方式。

1) How do broadcast-variables work internally? 1) 广播变量如何在内部工作？

My assumption is: With every closure that is used to perform an operation on the dataset, all it's referenced variables have to be serialized, transferred over the network and restored along with the task so that the closure can be executed.我的假设是：对于用于对数据集执行操作的每个闭包，所有它引用的变量都必须被序列化、通过网络传输并与任务一起恢复，以便闭包可以被执行。

When registering a broadcast-variable like this:当注册这样的广播变量时：

val broadcastVar = sc.broadcast("hello world")

the returned object ( Broadcast[String] ) doesn't keep a reference to the actual object ("hello world") but only some ID.返回的对象（ Broadcast[String] ）不保留对实际对象（“hello world”）的引用，而只保留一些 ID。 When a broadcast-variable-handle is being referenced from within a closure like said above, it will be serialized just the way every other variable is - just that a broadcast-variable-handle itself doesn't contain the actual object.当一个广播变量句柄从上面所说的闭包中被引用时，它将像其他所有变量一样被序列化 - 只是广播变量句柄本身不包含实际对象。

When the closure is later being executed on the target nodes, the actual object ("hello world") has been already transferred to each node.当闭包稍后在目标节点上执行时，实际对象（“hello world”）已经转移到每个节点。 When the closure hits the point where broadcastVar.value is called, the broadcast-variable-handle internally retrieves the actual object using the ID.当闭包到达调用broadcastVar.value的点时， broadcastVar.value变量句柄在内部使用 ID 检索实际对象。

Is this assumption correct?这个假设正确吗？

2) Is there a way to take advantage of this mechanism in Spark-SQL? 2) 有没有办法在 Spark-SQL 中利用这种机制？

Let's say I have a set of values that is allowed.假设我有一组允许的值。

When using the RDD-API I would create a broadcast-variable for my allowedValues:使用 RDD-API 时，我将为我的 allowedValues 创建一个广播变量：

val broadcastAllowedValues = sc.broadcast(allowedValues) // Broadcast[Set[String]]

rdd.filter(row => broadcastAllowedValues.value.contains(row("mycol")))

Naturally, when using the Spark-SQL-API I would use the Column.isin / Column.isInCollection method for that:当然，当使用 Spark-SQL-API 时，我会使用Column.isin / Column.isInCollection方法：

dataframe.where(col("mycol").isInCollection(allowedValues))

but it seems like I can't get the advantage of a broadcast-variable this way.但似乎我无法通过这种方式获得广播变量的优势。

Also, if I would change this piece of code to the following:另外，如果我将这段代码更改为以下内容：

val broadcastAllowedValues = sc.broadcast(allowedValues) // Broadcast[Set[String]]

dataframe.where(col("mycol").isInCollection(allowedValues.value))

this part:这部分：

col("mycol").isInCollection(allowedValues.value)
// and more important this part:
allowedValues.value

will already be evaluated on the driver, resulting in a new Column -Object.将已经在驱动程序上进行评估，从而产生一个新的Column -Object。 So the broadcast-variable looses it's advantage here.所以广播变量在这里失去了它的优势。 It even would have some overhead compared to the first example ...与第一个示例相比，它甚至会有一些开销......

Is there a way to take advantage of Broadcast-Variables using the Spark-SQL-API or do I have to explicitly use the RDD-API at these points?有没有办法使用 Spark-SQL-API 来利用广播变量，或者我是否必须在这些点显式使用 RDD-API？

Answer 1

How do broadcast-variables work internally?广播变量如何在内部工作？

The broadcasted data is serialized and physically moved to all executors.广播的数据被序列化并物理移动到所有执行器。 According to the documentation on Broadcast Variables , it says根据关于Broadcast Variables的文档，它说

"Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks." “广播变量允许程序员在每台机器上缓存一个只读变量，而不是随任务一起传送它的副本。”

Is there a way to take advantage of this mechanism in Spark-SQL?有没有办法在 Spark-SQL 中利用这种机制？

Yes, there is a way to take advantages.是的，有一种方法可以利用优势。 Spark applied by default a Broadcast Hash Join when joining a big and small Dataframe. Spark 在加入大小数据帧时默认应用广播哈希加入。

According to the book "Learning Spark - 2nd edition", it says:根据“Learning Spark - 2nd edition”一书，它说：

"By default Spark will use a broadcast join if the smaller data set is less then 10MB. This configuration is set in spark.sql.autoBroadcastJoinThreshold ; you can decrease or increase the size depending on how much memory you have on each executor and in the driver." “默认情况下，如果较小的数据集小于 10MB，Spark 将使用广播连接。此配置在spark.sql.autoBroadcastJoinThreshold设置；您可以根据每个执行程序和在司机。”

In your case you need to list all unique allowedValues into a simple DataFrame (dataframe called allowedeValuesDF ) with only one column (column called allowValues ) and apply a join to filter your dataframe .你的情况，你需要列出所有独特ALLOWEDVALUES成简单的数据帧（数据帧称为allowedeValuesDF ）只有一列（列名为allowValues ）和应用加入到过滤您的dataframe 。

Something like this:像这样的东西：

import org.apache.spark.sql.functions.broadcast
val result = dataframe.join(broadcast(allowedValuesDF), "mycol === allowedValues")

Actually, you could leave out the broadcast as Spark will do a broadcast join by default.实际上，您可以省略broadcast因为默认情况下 Spark 会执行广播连接。

Edit:编辑：

In later versions of Spark you could also use join hints in the SQL syntax to tell the execution engine which strategies to use.在更高版本的 Spark 中，您还可以在 SQL 语法中使用连接提示来告诉执行引擎使用哪些策略。 Details are provided in the SQL Documentation and an example is provided below: SQL 文档中提供了详细信息，下面提供了一个示例：

-- Join Hints for broadcast join 
SELECT /*+ BROADCAST(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

仅使用 Spark-SQL API 时广播变量的使用

问题描述

1) How do broadcast-variables work internally? 1) 广播变量如何在内部工作？

2) Is there a way to take advantage of this mechanism in Spark-SQL? 2) 有没有办法在 Spark-SQL 中利用这种机制？

1 个解决方案

解决方案1
2 已采纳 2020-11-07 15:44:33

仅使用 Spark-SQL API 时广播变量的使用

问题描述

1) How do broadcast-variables work internally? 1) 广播变量如何在内部工作？

2) Is there a way to take advantage of this mechanism in Spark-SQL? 2) 有没有办法在 Spark-SQL 中利用这种机制？

1 个解决方案

解决方案1 2 已采纳 2020-11-07 15:44:33

解决方案1
2 已采纳 2020-11-07 15:44:33