简体   繁体   English

Pyspark内存不足窗口功能

[英]Pyspark Out of Memory window function

I'm seeing a few scalability problems with a pyspark script I've written and was wondering if anyone would be able to shed a bit of light. 我看到了我编写的pyspark脚本的一些可伸缩性问题,并且想知道是否有人可以对此有所了解。

I have a very similar use case to the one presented here: 我有一个与这里展示的用例非常相似的用例:

Separate multi line record with start and end delimiter 带有开始和结束定界符的单独的多行记录

In that I have some multi line data that where there is a logical delimiter between records. 在那我有一些多行数据,在记录之间有逻辑定界符。 Eg the data looks like: 例如,数据如下所示:

AA123
BB123
CCXYZ
AA321
BB321
CCZYX
...

Using the example in the previous answer, I've separated this into multiple records using a script like... 使用上一个答案中的示例,我已使用类似...的脚本将其分为多个记录。

spark = SparkSession \
    .builder \
    .appName("TimetableSession") \
    #Played around with setting the available memory at runtime
    .config("spark.executor.memory", "8g") \
    .config("spark.driver.memory", "8g") \
    .getOrCreate()

files = os.path.join("data","*_lots_of_gzipped_files.gz")
df=spark.sparkContext.textFile(files).toDF()
df=df.withColumn("id", monotonically_increasing_id())

w=Window.partitionBy().orderBy('id')
df=df.withColumn('AA_indicator', expr("case when entry like 'AA%' then 1 else 0 end"))
#!!!Blowing up with OOM errors here at scale!!!
df=df.withColumn('index', sum('AA_indicator').over(w))
df.show()

+--------------------+---+------------+-----+
|               entry| id|AA_indicator|index|
+--------------------+---+------------+-----+
|               AA123|  1|           1|    1|
|               BB123|  2|           0|    1|
|               CCXYZ|  3|           0|    1|
|               AA321|  4|           1|    2|
|               BB321|  5|           0|    2|
|               CCZYX|  6|           0|    2|
+--------------------+---+------------+-----+

This seems to work ok with data which is a reasonable size (eg 50MB of data) when I scale this up to > 1GB of data I'm seeing Java OOM errors. 当我将其扩展到大于1GB的数据时,这似乎可以处理合理大小的数据(例如50MB数据),但我看到Java OOM错误。 I'm seeing the same problem even when attempting to allocate > 20GB memory to spark.driver/executor. 即使尝试为spark.driver / executor分配大于20GB的内存,我也遇到相同的问题。

I believe the problem is that the window for the data partitioned and everything is being collected into memory at once rather than being parralelised? 我认为问题在于数据的窗口已分区,并且所有内容都被立即收集到内存中,而不是被并行化? But I might be way off the mark with this. 但是我可能与此相差甚远。

I'm running this script in a standalone docker container using the jupyter pyspark notebook https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook . 我正在使用jupyter pyspark笔记本https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook在独立的Docker容器中运行此脚本。

Any help in terms of a better approach to indexing 'records' or how to better approach the problem would be much appreciated. 我们将对采用更好的索引“记录”方法或如何更好地解决问题方面的帮助提供任何帮助。

Probably because you use window without PARTITION BY : 可能是因为您使用的窗口没有PARTITION BY

Window.partitionBy().orderBy('id')

In that case Spark doesn't distribute the data and processes all records on a single machine sequentially. 在这种情况下,Spark不会分发数据,而是顺序地在一台计算机上处​​理所有记录。

Having a lot of gzipped files makes it even worse, as gzip compression cannot be split. 由于无法拆分gzip gzipped文件,因此有很多gzipped文件会使情况变得更糟。 So each file is loaded on a single machine, and can OOM as well. 因此,每个文件都加载到一台计算机上,并且也可以OOM。

Overall this is not something that benefits Spark. 总体而言,这并不是使Spark受益的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM