简体   繁体   中英

Broadcast large array in pyspark (~ 8GB)

In Pyspark, I am trying to broadcast a large numpy array of size around 8GB. But it fails with the error "OverflowError: cannot serialize a string larger than 4GiB". I have 15g in executor memory and 25g driver memory. I have tried using default and kyro serializer. Both didnot work and show same error. Can anyone suggest how to get rid of this error and the most efficient way to tackle large broadcast variables?

PySpark doesn't use Java side serialization for broadcasting so using Kryo or any other serialization setting won't help. It is simply a limitation of pickle protocol before version 4 .

Theoretically it should be possible to adjust PySpark code to use specific version of the protocol in Python 3.4+ but generally speaking I am not convinced it is worth the effort. In general broadcasting large variables in PySpark since it is not shared between executors.

If you really need this the simplest solution is just to split the array into multiple chunks with size less than 4GB. It won't make PySpark broadcasting more efficient but should solve your problem.

offset = ...
a_huge_array = np.array(...)

a_huge_array_block_1 = sc.broadcast(a_huge_array[0:offset])
a_huge_array_block_2 = sc.broadcast(a_huge_array[offset:2*offset])
...

A little bit more smarter way handle this is to distribute files using a local file-system instead of variables and access these via memory-mapping . You can for example use flat files or Memory-Mapped SQLite .

This is not the problem of PySpark, this is a limit of Spark implement.

Spark use a scala array to store the broadcast elements, since the max Integer of Scala is 2*10^9, so the total string bytes is 2*2*10^9 = 4GB, you can view the Spark code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM