PySpark：java.lang.OutofMemoryError：Java 堆空间

Question

I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM.我最近在我的服务器上使用 PySpark 和 Ipython，它有 24 个 CPU 和 32GB RAM。 Its running only on one machine.它只在一台机器上运行。 In my process, I want to collect huge amount of data as is give in below code:在我的过程中，我想收集以下代码中给出的大量数据：

train_dataRDD = (train.map(lambda x:getTagsAndText(x))
.filter(lambda x:x[-1]!=[])
.flatMap(lambda (x,text,tags): [(tag,(x,text)) for tag in tags])
.groupByKey()
.mapValues(list))

When I do当我做

training_data =  train_dataRDD.collectAsMap()

It gives me outOfMemory Error.它给了我 outOfMemory 错误。 Java heap Space . Java heap Space 。 Also, I can not perform any operations on Spark after this error as it looses connection with Java.此外，在出现此错误后，我无法对 Spark 执行任何操作，因为它与 Java 失去了连接。 It gives Py4JNetworkError: Cannot connect to the java server .它给出了Py4JNetworkError: Cannot connect to the java server 。

It looks like heap space is small.看起来堆空间很小。 How can I set it to bigger limits?如何将其设置为更大的限制？

EDIT :编辑：

Things that I tried before running: sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')我在运行之前尝试过的事情： sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')

I changed the spark options as per the documentation here(if you do ctrl-f and search for spark.executor.extraJavaOptions) : http://spark.apache.org/docs/1.2.1/configuration.html我根据此处的文档更改了 spark 选项（如果您执行 ctrl-f 并搜索 spark.executor.extraJavaOptions）： http ://spark.apache.org/docs/1.2.1/configuration.html

It says that I can avoid OOMs by setting spark.executor.memory option.它说我可以通过设置 spark.executor.memory 选项来避免 OOM。 I did the same thing but it seem not be working.我做了同样的事情，但似乎不起作用。

Answer 1

After trying out loads of configuration parameters, I found that there is only one need to be changed to enable more Heap space and ie spark.driver.memory .在尝试了大量配置参数后，我发现只需要更改一个即可启用更多 Heap 空间，即spark.driver.memory 。

sudo vim $SPARK_HOME/conf/spark-defaults.conf
#uncomment the spark.driver.memory and change it according to your use. I changed it to below
spark.driver.memory 15g
# press : and then wq! to exit vim editor

Close your existing spark application and re run it.关闭现有的 Spark 应用程序并重新运行它。 You will not encounter this error again.您不会再次遇到此错误。 :) :)

Answer 2

If you're looking for the way to set this from within the script or a jupyter notebook, you can do:如果您正在寻找从脚本或 jupyter 笔记本中设置它的方法，您可以执行以下操作：

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "15g") \
    .appName('my-cool-app') \
    .getOrCreate()

Answer 3

I had the same problem with pyspark (installed with brew ).我在pyspark遇到了同样的问题（与brew一起安装）。 In my case it was installed on the path /usr/local/Cellar/apache-spark .就我而言，它安装在路径/usr/local/Cellar/apache-spark 。

The only configuration file I had was in apache-spark/2.4.0/libexec/python//test_coverage/conf/spark-defaults.conf .我唯一的配置文件在apache-spark/2.4.0/libexec/python//test_coverage/conf/spark-defaults.conf 。

As suggested here I created the file spark-defaults.conf in the path /usr/local/Cellar/apache-spark/2.4.0/libexec/conf/spark-defaults.conf and appended to it the line spark.driver.memory 12g .正如这里所建议的，我在路径/usr/local/Cellar/apache-spark/2.4.0/libexec/conf/spark-defaults.conf创建了文件spark-defaults.conf并附加了一行spark.driver.memory 12g 。

PySpark：java.lang.OutofMemoryError：Java 堆空间

问题描述

3 个解决方案

解决方案1
72 已采纳 2015-09-03 15:42:03

解决方案2
25 2020-02-17 17:44:07

解决方案3
2 2019-01-09 14:59:16

PySpark：java.lang.OutofMemoryError：Java 堆空间

问题描述

3 个解决方案

解决方案1 72 已采纳 2015-09-03 15:42:03

解决方案2 25 2020-02-17 17:44:07

解决方案3 2 2019-01-09 14:59:16

解决方案1
72 已采纳 2015-09-03 15:42:03

解决方案2
25 2020-02-17 17:44:07

解决方案3
2 2019-01-09 14:59:16