简体   繁体   中英

rsparkling as_h2o_frame does not work: java.lang.OutOfMemoryError: GC overhead limit exceeded

I first import a dataset from csv to Spark, do some transformation in Spark, and then try to convert it into H2O Frame. Here's my code:

library(rsparkling)
library(h2o)
library(dplyr)
library(sparklyr)

sc <- spark_connect(master = "local")

data <- spark_read_csv(sc,"some_data", paste(path, file_name, sep = ""), memory = TRUE,
                       infer_schema = TRUE)
data_h2o <- as_h2o_frame(sc,data)

The size of the csv file is about 750MB. The last line takes very long time to run, and it fails with the following message:

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task
3 in stage 10.0 failed 1 times, most recent failure: Lost task 3.0 in stage 10.0
(TID 44, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space

I have 16GB of memory and the dataset can be read into H2O directly with no issues.

Here is part of the log file:

18/11/06 09:46:45 WARN MemoryStore: Not enough space to cache rdd_16_2 in memory! (computed 32.7 MB so far)
18/11/06 09:46:45 INFO MemoryStore: Memory use = 272.0 MB (blocks) + 57.9 MB (scratch space shared across 4 tasks(s)) = 329.8 MB. Storage limit = 366.3 MB.
18/11/06 09:46:45 INFO CodeGenerator: Code generated in 92.700007 ms
18/11/06 09:46:45 INFO MemoryStore: Will not store rdd_16_0
18/11/06 09:46:45 INFO BlockManager: Found block rdd_16_2 locally
18/11/06 09:46:45 WARN MemoryStore: Not enough space to cache rdd_16_3 in memory! (computed 32.8 MB so far)
18/11/06 09:46:45 INFO MemoryStore: Memory use = 272.0 MB (blocks) + 57.9 MB (scratch space shared across 4 tasks(s)) = 329.8 MB. Storage limit = 366.3 MB.
18/11/06 09:46:45 INFO BlockManager: Found block rdd_16_3 locally
18/11/06 09:46:45 WARN MemoryStore: Not enough space to cache rdd_16_0 in memory! (computed 32.6 MB so far)
18/11/06 09:46:45 INFO MemoryStore: Memory use = 272.0 MB (blocks) + 57.9 MB (scratch space shared across 4 tasks(s)) = 329.8 MB. Storage limit = 366.3 MB.
18/11/06 09:46:45 INFO BlockManager: Found block rdd_16_0 locally
18/11/06 09:46:45 INFO CodeGenerator: Code generated in 21.519354 ms
18/11/06 09:46:45 INFO MemoryStore: Will not store rdd_16_5
18/11/06 09:46:45 WARN MemoryStore: Not enough space to cache rdd_16_5 in memory! (computed 63.6 MB so far)
18/11/06 09:46:45 INFO MemoryStore: Memory use = 272.0 MB (blocks) + 57.9 MB (scratch space shared across 4 tasks(s)) = 329.8 MB. Storage limit = 366.3 MB.

I just found out how to solve this issue: simply increase memory allowed.

conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "16G"
conf$spark.memory.fraction <- 0.9

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM