简体   繁体   中英

Out of memory error when converting pandas dataframe to pyspark dataframe

I have a pandas dataframe consisting of 180M rows and 4 columns (all integers). I saved it as a pickle file and the file is 5.8GB. I'm trying to convert the pandas dataframe to pyspark dataframe using spark_X = spark.createDataFrame(X) , but keep getting a "out of memory" error.

The error snippet is

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. : java.lang.OutOfMemoryError: Java heap space

I have over 200GB of memory and I don't think a lack of physical memory is the issue. I read that there are multiple memory limitations, eg driver memory - could this be the cause?

How can I resolve or workaround this?

As suggested by @bzu, the answer here solved my problem.

I did have to manually create the $SPARK_HOME/conf folder and spark-defaults.conf file, though, as they did not exist. Also, I changed the setting to

spark.driver.memory 32g

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM