Pyspark内存不足窗口功能

Question

I'm seeing a few scalability problems with a pyspark script I've written and was wondering if anyone would be able to shed a bit of light. 我看到了我编写的pyspark脚本的一些可伸缩性问题，并且想知道是否有人可以对此有所了解。

I have a very similar use case to the one presented here: 我有一个与这里展示的用例非常相似的用例：

Separate multi line record with start and end delimiter 带有开始和结束定界符的单独的多行记录

In that I have some multi line data that where there is a logical delimiter between records. 在那我有一些多行数据，在记录之间有逻辑定界符。 Eg the data looks like: 例如，数据如下所示：

AA123
BB123
CCXYZ
AA321
BB321
CCZYX
...

Using the example in the previous answer, I've separated this into multiple records using a script like... 使用上一个答案中的示例，我已使用类似...的脚本将其分为多个记录。

spark = SparkSession \
    .builder \
    .appName("TimetableSession") \
    #Played around with setting the available memory at runtime
    .config("spark.executor.memory", "8g") \
    .config("spark.driver.memory", "8g") \
    .getOrCreate()

files = os.path.join("data","*_lots_of_gzipped_files.gz")
df=spark.sparkContext.textFile(files).toDF()
df=df.withColumn("id", monotonically_increasing_id())

w=Window.partitionBy().orderBy('id')
df=df.withColumn('AA_indicator', expr("case when entry like 'AA%' then 1 else 0 end"))
#!!!Blowing up with OOM errors here at scale!!!
df=df.withColumn('index', sum('AA_indicator').over(w))
df.show()

+--------------------+---+------------+-----+
|               entry| id|AA_indicator|index|
+--------------------+---+------------+-----+
|               AA123|  1|           1|    1|
|               BB123|  2|           0|    1|
|               CCXYZ|  3|           0|    1|
|               AA321|  4|           1|    2|
|               BB321|  5|           0|    2|
|               CCZYX|  6|           0|    2|
+--------------------+---+------------+-----+

This seems to work ok with data which is a reasonable size (eg 50MB of data) when I scale this up to > 1GB of data I'm seeing Java OOM errors. 当我将其扩展到大于1GB的数据时，这似乎可以处理合理大小的数据（例如50MB数据），但我看到Java OOM错误。 I'm seeing the same problem even when attempting to allocate > 20GB memory to spark.driver/executor. 即使尝试为spark.driver / executor分配大于20GB的内存，我也遇到相同的问题。

I believe the problem is that the window for the data partitioned and everything is being collected into memory at once rather than being parralelised? 我认为问题在于数据的窗口已分区，并且所有内容都被立即收集到内存中，而不是被并行化？ But I might be way off the mark with this. 但是我可能与此相差甚远。

I'm running this script in a standalone docker container using the jupyter pyspark notebook https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook . 我正在使用jupyter pyspark笔记本https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook在独立的Docker容器中运行此脚本。

Any help in terms of a better approach to indexing 'records' or how to better approach the problem would be much appreciated. 我们将对采用更好的索引“记录”方法或如何更好地解决问题方面的帮助提供任何帮助。

Answer 1

Probably because you use window without PARTITION BY : 可能是因为您使用的窗口没有PARTITION BY ：

Window.partitionBy().orderBy('id')

In that case Spark doesn't distribute the data and processes all records on a single machine sequentially. 在这种情况下，Spark不会分发数据，而是顺序地在一台计算机上处理所有记录。

Having a lot of gzipped files makes it even worse, as gzip compression cannot be split. 由于无法拆分gzip gzipped文件，因此有很多gzipped文件会使情况变得更糟。 So each file is loaded on a single machine, and can OOM as well. 因此，每个文件都加载到一台计算机上，并且也可以OOM。

Overall this is not something that benefits Spark. 总体而言，这并不是使Spark受益的东西。

Consider uncompressing the files. 考虑解压缩文件。
Replacing cumulative sum with window, with lower level code, like shown in How to compute cumulative sum using Spark 用较低级的代码用窗口代替累积和，如如何使用Spark计算累积和所示
This also seems to be relevant: Avoid performance impact of a single partition mode in Spark window functions 这似乎也很相关：避免在Spark窗口函数中使用单个分区模式对性能产生影响

Pyspark内存不足窗口功能

问题描述

1 个解决方案

解决方案1
2 2018-02-01 15:21:40

Pyspark内存不足窗口功能

问题描述

1 个解决方案

解决方案1 2 2018-02-01 15:21:40

解决方案1
2 2018-02-01 15:21:40