简体繁体 English

如何在Spark Streaming映射函数中广播变量？

[英]How to broadcast a variable in a Spark Streaming mapping function?

原文 2016-07-15 06:03:24 5 1 java/ apache-kafka/ spark-streaming

I know the usual routine: sc.broadcast(x) . 我知道通常的例程： sc.broadcast(x) 。

However, currently Spark Streaming does not support broadcast variables with checkpointing. 但是，当前Spark Streaming不支持带有检查点的广播变量。

The official guide provides a solution: http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-and-broadcast-variables . 官方指南提供了一种解决方案： http : //spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-and-broadcast-variables 。 However, this solution can be only used for foreachRDD functions. 但是，此解决方案只能用于foreachRDD函数。

Now I want to use large or unserializable variables (like a KafkaProducer ) that need to be broadcast in this way in mapping functions (such as flatMapToPair ), but since there is no visible RDD variables, I cannot retrieve the Spark context to broadcast the lazy-evaluated variable. 现在，我想使用需要以这种方式在映射函数（例如flatMapToPair ）中广播的较大或KafkaProducer序列化的变量（例如flatMapToPair ），但是由于没有可见的RDD变量，因此无法检索Spark上下文来广播延迟-评估的变量。 If I use the initial context for creating DStreams or the context retrieved from DStreams, the task becomes not serializable. 如果我使用初始上下文创建DStream或从DStream检索的上下文，则该任务无法序列化。

So how can I use broadcast variables in mapping functions? 那么如何在映射函数中使用广播变量？ Or is there any workaround for using large or unserializable variables in mapping functions? 还是在映射函数中使用较大或不可序列化的变量有任何解决方法？

1 个解决方案

I finally find the solution. 我终于找到了解决方案。 To use these features, use the transform functions rather than the map functions. 要使用这些功能，请使用转换功能而不是地图功能。 In the transform functions, we manually handle RDDs and apply map functions on them, so we can get the reference of RDDs and thus get the Spark context from them. 在转换函数中，我们手动处理RDD并在其上应用映射函数，因此我们可以获得RDD的引用，从而从中获取Spark上下文。