简体   繁体   English

使用 toPandas() 和 databricks 连接时遇到“java.lang.OutOfMemoryError: Java 堆空间”

[英]Running into 'java.lang.OutOfMemoryError: Java heap space' when using toPandas() and databricks connect

I'm trying to transform a pyspark dataframe of size [2734984 rows x 11 columns] to a pandas dataframe calling toPandas() . I'm trying to transform a pyspark dataframe of size [2734984 rows x 11 columns] to a pandas dataframe calling toPandas() . Whereas it is working totally fine (11 seconds) when using an Azure Databricks Notebook, I run into a java.lang.OutOfMemoryError: Java heap space exception when i run the exact same code using databricks-connect (db-connect version and Databricks Runtime Version match and are both 7.1). Whereas it is working totally fine (11 seconds) when using an Azure Databricks Notebook, I run into a java.lang.OutOfMemoryError: Java heap space exception when i run the exact same code using databricks-connect (db-connect version and Databricks Runtime版本匹配并且都是 7.1)。

I already increased the spark driver memory (100g) and the maxResultSize (15g).我已经增加了火花驱动器 memory (100g) 和 maxResultSize (15g)。 I suppose that the error lies somewhere in databricks-connect because I cannot replicate it using the Notebooks.我想错误出在 databricks-connect 的某个地方,因为我无法使用笔记本复制它。

Any hint what's going on here?任何提示这里发生了什么?

The error is the following one:错误如下:

Exception in thread "serve-Arrow" java.lang.OutOfMemoryError: Java heap space
    at com.ning.compress.lzf.ChunkDecoder.decode(ChunkDecoder.java:51)
    at com.ning.compress.lzf.LZFDecoder.decode(LZFDecoder.java:102)
    at com.databricks.service.SparkServiceRPCClient.executeRPC0(SparkServiceRPCClient.scala:84)
    at com.databricks.service.SparkServiceRemoteFuncRunner.withRpcRetries(SparkServiceRemoteFuncRunner.scala:234)
    at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPC(SparkServiceRemoteFuncRunner.scala:156)
    at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPCHandleCancels(SparkServiceRemoteFuncRunner.scala:287)
    at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute0$1(SparkServiceRemoteFuncRunner.scala:118)
    at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$934/2145652039.apply(Unknown Source)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.service.SparkServiceRemoteFuncRunner.withRetry(SparkServiceRemoteFuncRunner.scala:135)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute0(SparkServiceRemoteFuncRunner.scala:113)
    at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute$1(SparkServiceRemoteFuncRunner.scala:86)
    at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$1031/465320026.apply(Unknown Source)
    at com.databricks.spark.util.Log4jUsageLogger.recordOperation(UsageLogger.scala:210)
    at com.databricks.spark.util.UsageLogging.recordOperation(UsageLogger.scala:346)
    at com.databricks.spark.util.UsageLogging.recordOperation$(UsageLogger.scala:325)
    at com.databricks.service.SparkServiceRPCClientStub.recordOperation(SparkServiceRPCClientStub.scala:61)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute(SparkServiceRemoteFuncRunner.scala:78)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute$(SparkServiceRemoteFuncRunner.scala:67)
    at com.databricks.service.SparkServiceRPCClientStub.execute(SparkServiceRPCClientStub.scala:61)
    at com.databricks.service.SparkServiceRPCClientStub.executeRDD(SparkServiceRPCClientStub.scala:225)
    at com.databricks.service.SparkClient$.executeRDD(SparkClient.scala:279)
    at com.databricks.spark.util.SparkClientContext$.executeRDD(SparkClientContext.scala:161)
    at org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:864)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:928)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2331)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2426)
    at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$6(Dataset.scala:3638)
    at org.apache.spark.sql.Dataset$$Lambda$3567/1086808304.apply$mcV$sp(Unknown Source)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$3(Dataset.scala:3642)```

This is likely because Databricks-connect is executing the toPandas on the client machine which can then run out of memory.这可能是因为 Databricks-connect 正在客户端计算机上执行 toPandas,然后可能会用完 memory。 You could increase the local driver memory by setting spark.driver.memory in the (local) config file ${spark_home}/conf/spark-defaults.conf where ${spark_home} can be obtained with databricks-connect get-spark-home .您可以通过在(本地)配置文件${spark_home}/conf/spark-defaults.conf中设置spark.driver.memory来增加本地驱动程序 memory,其中${spark_home}可以通过databricks-connect get-spark-home .

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark ALS:Java堆空间用尽:java.lang.OutOfMemoryError:Java堆空间 - Spark ALS: Running out of java heap space: java.lang.OutOfMemoryError: Java heap space Pyspark Java.lang.OutOfMemoryError:Java 堆空间 - Pyspark Java.lang.OutOfMemoryError: Java heap space Spark流:java.lang.OutOfMemoryError:Java堆空间 - Spark Streaming: java.lang.OutOfMemoryError: Java heap space PySpark应用程序失败,出现java.lang.OutOfMemoryError:Java堆空间 - PySpark application fail with java.lang.OutOfMemoryError: Java heap space python程序“ java.lang.OutOfMemoryError:Java堆空间”的火花错误 - Spark error for python program “java.lang.OutOfMemoryError: Java heap space” PySpark:线程“ dag-scheduler-event-loop”中的异常java.lang.OutOfMemoryError:Java堆空间 - PySpark: Exception in thread “dag-scheduler-event-loop” java.lang.OutOfMemoryError: Java heap space 当所有内存设置都设置为巨大时,rdd.collect()中的java.lang.OutOfMemoryError - java.lang.OutOfMemoryError in rdd.collect() when all memory setting are set to huge java.lang.OutOfMemoryError:无法获取100个字节的内存,得到0 - java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0 java.lang.OutOfMemoryError: 无法获取 36 字节内存,得到 0 - java.lang.OutOfMemoryError: Unable to acquire 36 bytes of memory, got 0 为什么我可以使用SparkSQL show()一个数据框,但无法将其写入json并得到“ java.lang.OutOfMemoryError” - Why I can show() a dataframe using SparkSQL but cannot write it to json and got “java.lang.OutOfMemoryError”
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM