在pyspark中广播大型阵列（〜8GB）

Question

In Pyspark, I am trying to broadcast a large numpy array of size around 8GB. 在Pyspark中，我试图广播大约8GB的大型numpy数组。 But it fails with the error "OverflowError: cannot serialize a string larger than 4GiB". 但是它失败，并显示错误“ OverflowError：无法序列化大于4GiB的字符串”。 I have 15g in executor memory and 25g driver memory. 我的执行者内存和15g驱动程序内存为15g。 I have tried using default and kyro serializer. 我尝试使用默认和kyro序列化程序。 Both didnot work and show same error. 两者均不起作用，并显示相同的错误。 Can anyone suggest how to get rid of this error and the most efficient way to tackle large broadcast variables? 谁能建议如何摆脱这个错误，以及解决大型广播变量的最有效方法？

Answer 1

PySpark doesn't use Java side serialization for broadcasting so using Kryo or any other serialization setting won't help. PySpark不使用Java端序列化进行广播，因此使用Kryo或任何其他序列化设置都无济于事。 It is simply a limitation of pickle protocol before version 4 . 这只是版本4之前的pickle协议的限制。

Theoretically it should be possible to adjust PySpark code to use specific version of the protocol in Python 3.4+ but generally speaking I am not convinced it is worth the effort. 从理论上讲，应该可以将PySpark代码调整为使用Python 3.4+中协议的特定版本，但总的来说，我不认为这样做值得。 In general broadcasting large variables in PySpark since it is not shared between executors. 通常在PySpark中广播大变量，因为执行者之间不会共享它。

If you really need this the simplest solution is just to split the array into multiple chunks with size less than 4GB. 如果您确实需要此功能，最简单的解决方案是将阵列拆分为多个大小小于4GB的块。 It won't make PySpark broadcasting more efficient but should solve your problem. 它不会使PySpark广播更加有效，但应该可以解决您的问题。

offset = ...
a_huge_array = np.array(...)

a_huge_array_block_1 = sc.broadcast(a_huge_array[0:offset])
a_huge_array_block_2 = sc.broadcast(a_huge_array[offset:2*offset])
...

A little bit more smarter way handle this is to distribute files using a local file-system instead of variables and access these via memory-mapping . 更智能的方法是使用本地文件系统而不是变量来分发文件，然后通过内存映射访问这些文件。 You can for example use flat files or Memory-Mapped SQLite . 例如，您可以使用平面文件或内存映射的SQLite 。

Answer 2

This is not the problem of PySpark, this is a limit of Spark implement. 这不是PySpark的问题，这是Spark工具的限制。

Spark use a scala array to store the broadcast elements, since the max Integer of Scala is 2*10^9, so the total string bytes is 2*2*10^9 = 4GB, you can view the Spark code. Spark使用scala数组存储广播元素，因为Scala的最大Integer为2 * 10 ^ 9，所以总的字符串字节为2 * 2 * 10 ^ 9 = 4GB，因此可以查看Spark代码。

在pyspark中广播大型阵列（〜8GB）

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-01-25 17:32:09

解决方案2
0 2016-01-26 12:23:46

在pyspark中广播大型阵列（〜8GB）

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-01-25 17:32:09

解决方案2 0 2016-01-26 12:23:46

解决方案1
1 已采纳 2016-01-25 17:32:09

解决方案2
0 2016-01-26 12:23:46