[英]Getting OutOfMemoryError GC overhead limit exceeded when collecting a dataset in Java Spark
I have some data of approximate size 250MB.我有一些大约 250MB 的数据。
I want to load the data and convert it to a map我想加载数据并将其转换为 map
class MyData implements Serializable {
private Map<String, List<SomeObject>> myMap;
MyData(SparkSession sparkSession, String inputPath) {
Dataset<Klass> ds = sparkSession.read().json(inputPath).as(Encoders.bean(Klass.class));
myMap = ds.collectAsList().stream().collect(Collectors.toMap(
Klass::getField1(),
Klass::getField2()
)
);
}
}
This is my spark execution configuration这是我的火花执行配置
--master yarn --deploy-mode cluster --executor-cores 2 --num-executors 200 --executor-memory 10240M
Is it not a good practice to convert dataset to a list/map?将数据集转换为列表/地图不是一个好习惯吗? Or is it a configuration issue?还是配置问题? Or a code issue?还是代码问题?
It looks like your collecting all the data in the Dataset into the Spark driver with:看起来您将 Dataset 中的所有数据收集到 Spark 驱动程序中:
myMap = ds.collectAsList()...
Therefore you should set the driver memory with --driver-memory 2G
on the command line (aka your "spark execution configuration".因此,您应该在命令行上使用--driver-memory 2G
设置驱动程序 memory (也就是您的“火花执行配置”。
The default value for this parameter is 1G
which is likely not quite enough for 250M of raw data.此参数的默认值为1G
,这对于 250M 的原始数据可能还不够。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.