在 Java Spark 中收集数据集时出现 OutOfMemoryError GC 开销限制

Question

I have some data of approximate size 250MB.我有一些大约 250MB 的数据。

I want to load the data and convert it to a map我想加载数据并将其转换为 map

class MyData implements Serializable {

    private Map<String, List<SomeObject>> myMap;

    MyData(SparkSession sparkSession, String inputPath) {

        Dataset<Klass> ds = sparkSession.read().json(inputPath).as(Encoders.bean(Klass.class));
        myMap = ds.collectAsList().stream().collect(Collectors.toMap(
                                    Klass::getField1(),
                                    Klass::getField2()
                            )
                    );
    }
}

This is my spark execution configuration这是我的火花执行配置

--master yarn --deploy-mode cluster --executor-cores 2 --num-executors 200 --executor-memory 10240M

Is it not a good practice to convert dataset to a list/map?将数据集转换为列表/地图不是一个好习惯吗？ Or is it a configuration issue?还是配置问题？ Or a code issue?还是代码问题？

Answer 1

It looks like your collecting all the data in the Dataset into the Spark driver with:看起来您将 Dataset 中的所有数据收集到 Spark 驱动程序中：

myMap = ds.collectAsList()...

Therefore you should set the driver memory with --driver-memory 2G on the command line (aka your "spark execution configuration".因此，您应该在命令行上使用--driver-memory 2G设置驱动程序 memory （也就是您的“火花执行配置”。

The default value for this parameter is 1G which is likely not quite enough for 250M of raw data.此参数的默认值为1G ，这对于 250M 的原始数据可能还不够。

在 Java Spark 中收集数据集时出现 OutOfMemoryError GC 开销限制

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-30 14:26:00

在 Java Spark 中收集数据集时出现 OutOfMemoryError GC 开销限制

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-30 14:26:00

解决方案1
1 已采纳 2020-04-30 14:26:00