简体   繁体   English

在 Java Spark 中收集数据集时出现 OutOfMemoryError GC 开销限制

[英]Getting OutOfMemoryError GC overhead limit exceeded when collecting a dataset in Java Spark

I have some data of approximate size 250MB.我有一些大约 250MB 的数据。

I want to load the data and convert it to a map我想加载数据并将其转换为 map

class MyData implements Serializable {

    private Map<String, List<SomeObject>> myMap;

    MyData(SparkSession sparkSession, String inputPath) {

        Dataset<Klass> ds = sparkSession.read().json(inputPath).as(Encoders.bean(Klass.class));
        myMap = ds.collectAsList().stream().collect(Collectors.toMap(
                                    Klass::getField1(),
                                    Klass::getField2()
                            )
                    );
    }
}

This is my spark execution configuration这是我的火花执行配置

--master yarn --deploy-mode cluster --executor-cores 2 --num-executors 200 --executor-memory 10240M

Is it not a good practice to convert dataset to a list/map?将数据集转换为列表/地图不是一个好习惯吗? Or is it a configuration issue?还是配置问题? Or a code issue?还是代码问题?

It looks like your collecting all the data in the Dataset into the Spark driver with:看起来您将 Dataset 中的所有数据收集到 Spark 驱动程序中:

myMap = ds.collectAsList()...

Therefore you should set the driver memory with --driver-memory 2G on the command line (aka your "spark execution configuration".因此,您应该在命令行上使用--driver-memory 2G设置驱动程序 memory (也就是您的“火花执行配置”。

The default value for this parameter is 1G which is likely not quite enough for 250M of raw data.此参数的默认值为1G ,这对于 250M 的原始数据可能还不够。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Java Spark - java.lang.OutOfMemoryError:超出 GC 开销限制 - 大型数据集 - Java Spark - java.lang.OutOfMemoryError: GC overhead limit exceeded - Large Dataset Spark失败了java.lang.OutOfMemoryError:超出了GC开销限制? - Spark fails with java.lang.OutOfMemoryError: GC overhead limit exceeded? 在Apache Spark Java中无法.collect(),OutOfMemoryError:超出了GC开销限制 - Cannot .collect() in Apache Spark Java, OutOfMemoryError: GC overhead limit exceeded SPARK SQL java.lang.OutOfMemoryError:超出GC开销限制 - SPARK SQL java.lang.OutOfMemoryError: GC overhead limit exceeded 获取错误:java.lang.OutOfMemoryError:超出了GC开销限制 - Getting Error:java.lang.OutOfMemoryError: GC overhead limit exceeded 获取java.lang.OutOfMemoryError:Jboss中超出了GC开销限制 - Getting java.lang.OutOfMemoryError: GC overhead limit exceeded in Jboss OutOfMemoryError:超出了GC开销限制 - OutOfMemoryError: GC overhead limit exceeded Spark作业抛出“ java.lang.OutOfMemoryError:超出了GC开销限制” - Spark job throwing “java.lang.OutOfMemoryError: GC overhead limit exceeded” Spark DataFrame java.lang.OutOfMemoryError:长时间运行超出了GC开销限制 - Spark DataFrame java.lang.OutOfMemoryError: GC overhead limit exceeded on long loop run 线程“dispatcher-event-loop-5”java.lang.OutOfMemoryError 中的异常:超出 GC 开销限制:Spark - Exception in thread "dispatcher-event-loop-5" java.lang.OutOfMemoryError: GC overhead limit exceeded : Spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM