Java Spark - java.lang.OutOfMemoryError：超出 GC 开销限制 - 大型数据集

Question

We have a spark SQL query that returns over 5 million rows.我们有一个返回超过 500 万行的 spark SQL 查询。 Collecting them all for processing results in java.lang.OutOfMemoryError: GC overhead limit exceeded (eventually).将它们全部收集起来进行处理会导致 java.lang.OutOfMemoryError: GC 开销限制超出（最终）。 Here's the code:这是代码：

final Dataset<Row> jdbcDF = sparkSession.read().format("jdbc")
            .option("url", "xxxx")
            .option("driver", "com.ibm.db2.jcc.DB2Driver")
            .option("query", sql)
            .option("user", "xxxx")
            .option("password", "xxxx")
            .load();
    final Encoder<GdxClaim> gdxClaimEncoder = Encoders.bean(GdxClaim.class);
    final Dataset<GdxClaim> gdxClaimDataset = jdbcDF.as(gdxClaimEncoder);
    System.out.println("BEFORE PARALLELIZE");
    final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());
    System.out.println("AFTER");
    final JavaRDD<ClaimResponse> gdxClaimResponse = gdxClaimJavaRDD.mapPartitions(mapFunc);
    mapFunc = (FlatMapFunction<Iterator<GdxClaim>, ClaimResponse>) claim -> {
        System.out.println(":D " + claim.next().getRBAT_ID());
        if (claim != null && !currentRebateId.equals((claim.next().getRBAT_ID()))) {
            if (redisCommands == null || (claim.next().getRBAT_ID() == null)) {
                    serObjList = Collections.emptyList();
                } else {

                    generateYearQuarterKeys(claim.next());

                    redisBillingKeys = redisBillingKeys.stream().collect(Collectors.toList());
                    final String[] stringArray = redisBillingKeys.toArray(new String[redisBillingKeys.size()]);
                    serObjList = redisCommands.mget(stringArray);

                    serObjList = serObjList.stream().filter(clientObj -> clientObj.hasValue()).collect(Collectors.toList());
                    deserializeClientData(serObjList);
                    currentRebateId = (claim.next().getRBAT_ID());
            }
        }
        return (Iterator) racAssignmentService.assignRac(claim.next(), listClientRegArr);

    };

You can ignore most of this, the line that runs forever and never can return is:您可以忽略其中的大部分内容，永远运行且永远无法返回的行是：

final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());

Because of: gdxClaimDataset.collectAsList()因为：gdxClaimDataset.collectAsList()

We are unsure where to go from here and totally stuck.我们不确定从这里到 go 的位置并且完全卡住了。 Can anyone help?任何人都可以帮忙吗？ We've looked everywhere for some example to help.我们到处寻找一些例子来提供帮助。

Answer 1

At a high level, collectAsList() is going to bring your entire dataset into memory, and this is what you need to avoid doing.在高层次上， collectAsList()将把你的整个数据集带入 memory，这是你需要避免的。

You may want to look at the Dataset docs in general (same link as above).您可能想查看一般的数据集文档（与上面的链接相同）。 They explain its behavior, including the javaRDD() method, which is probably the way to avoid collectAsList() .他们解释了它的行为，包括javaRDD()方法，这可能是避免collectAsList()的方法。

Keep in mind: other "terminal" operations, that collect your dataset into memory, will cause the same problem.请记住：将数据集收集到 memory 中的其他“终端”操作将导致相同的问题。 The key is to filter down to your small subset, whatever that is, either before or during the process of collection.关键是过滤到您的小子集，无论是在收集之前还是在收集过程中。

Answer 2

Try to replace this line:尝试替换这一行：

final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());

with:和：

final JavaRDD<GdxClaim> gdxClaimJavaRDD = gdxClaimDataset.javaRDD();

Java Spark - java.lang.OutOfMemoryError：超出 GC 开销限制 - 大型数据集

问题描述

2 个解决方案

解决方案1
2 2019-11-20 21:30:08

解决方案2
1 已采纳 2019-11-20 21:11:46

Java Spark - java.lang.OutOfMemoryError：超出 GC 开销限制 - 大型数据集

问题描述

2 个解决方案

解决方案1 2 2019-11-20 21:30:08

解决方案2 1 已采纳 2019-11-20 21:11:46

解决方案1
2 2019-11-20 21:30:08

解决方案2
1 已采纳 2019-11-20 21:11:46