简体   繁体   English

Java Spark - java.lang.OutOfMemoryError:超出 GC 开销限制 - 大型数据集

[英]Java Spark - java.lang.OutOfMemoryError: GC overhead limit exceeded - Large Dataset

We have a spark SQL query that returns over 5 million rows.我们有一个返回超过 500 万行的 spark SQL 查询。 Collecting them all for processing results in java.lang.OutOfMemoryError: GC overhead limit exceeded (eventually).将它们全部收集起来进行处理会导致 java.lang.OutOfMemoryError: GC 开销限制超出(最终)。 Here's the code:这是代码:

final Dataset<Row> jdbcDF = sparkSession.read().format("jdbc")
            .option("url", "xxxx")
            .option("driver", "com.ibm.db2.jcc.DB2Driver")
            .option("query", sql)
            .option("user", "xxxx")
            .option("password", "xxxx")
            .load();
    final Encoder<GdxClaim> gdxClaimEncoder = Encoders.bean(GdxClaim.class);
    final Dataset<GdxClaim> gdxClaimDataset = jdbcDF.as(gdxClaimEncoder);
    System.out.println("BEFORE PARALLELIZE");
    final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());
    System.out.println("AFTER");
    final JavaRDD<ClaimResponse> gdxClaimResponse = gdxClaimJavaRDD.mapPartitions(mapFunc);
    mapFunc = (FlatMapFunction<Iterator<GdxClaim>, ClaimResponse>) claim -> {
        System.out.println(":D " + claim.next().getRBAT_ID());
        if (claim != null && !currentRebateId.equals((claim.next().getRBAT_ID()))) {
            if (redisCommands == null || (claim.next().getRBAT_ID() == null)) {
                    serObjList = Collections.emptyList();
                } else {

                    generateYearQuarterKeys(claim.next());

                    redisBillingKeys = redisBillingKeys.stream().collect(Collectors.toList());
                    final String[] stringArray = redisBillingKeys.toArray(new String[redisBillingKeys.size()]);
                    serObjList = redisCommands.mget(stringArray);

                    serObjList = serObjList.stream().filter(clientObj -> clientObj.hasValue()).collect(Collectors.toList());
                    deserializeClientData(serObjList);
                    currentRebateId = (claim.next().getRBAT_ID());
            }
        }
        return (Iterator) racAssignmentService.assignRac(claim.next(), listClientRegArr);

    };

You can ignore most of this, the line that runs forever and never can return is:您可以忽略其中的大部分内容,永远运行且永远无法返回的行是:

final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());

Because of: gdxClaimDataset.collectAsList()因为:gdxClaimDataset.collectAsList()

We are unsure where to go from here and totally stuck.我们不确定从这里到 go 的位置并且完全卡住了。 Can anyone help?任何人都可以帮忙吗? We've looked everywhere for some example to help.我们到处寻找一些例子来提供帮助。

At a high level, collectAsList() is going to bring your entire dataset into memory, and this is what you need to avoid doing.在高层次上, collectAsList()将把你的整个数据集带入 memory,这是你需要避免的。

You may want to look at the Dataset docs in general (same link as above).您可能想查看一般的数据集文档(与上面的链接相同)。 They explain its behavior, including the javaRDD() method, which is probably the way to avoid collectAsList() .他们解释了它的行为,包括javaRDD()方法,这可能是避免collectAsList()的方法。

Keep in mind: other "terminal" operations, that collect your dataset into memory, will cause the same problem.请记住:将数据集收集到 memory 中的其他“终端”操作将导致相同的问题。 The key is to filter down to your small subset, whatever that is, either before or during the process of collection.关键是过滤到您的小子集,无论是在收集之前还是在收集过程中。

Try to replace this line:尝试替换这一行:

final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());

with:和:

final JavaRDD<GdxClaim> gdxClaimJavaRDD = gdxClaimDataset.javaRDD();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark失败了java.lang.OutOfMemoryError:超出了GC开销限制? - Spark fails with java.lang.OutOfMemoryError: GC overhead limit exceeded? SPARK SQL java.lang.OutOfMemoryError:超出GC开销限制 - SPARK SQL java.lang.OutOfMemoryError: GC overhead limit exceeded Java PreparedStatement java.lang.OutOfMemoryError:超出了GC开销限制 - Java PreparedStatement java.lang.OutOfMemoryError: GC overhead limit exceeded 詹金斯 java.lang.OutOfMemoryError:超出 GC 开销限制 - Jenkins java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError:GC开销限制超出了android studio - java.lang.OutOfMemoryError: GC overhead limit exceeded android studio Gridgain:java.lang.OutOfMemoryError:超出了GC开销限制 - Gridgain: java.lang.OutOfMemoryError: GC overhead limit exceeded SonarQube java.lang.OutOfMemoryError:超出了GC开销限制 - SonarQube java.lang.OutOfMemoryError: GC overhead limit exceeded Tomcat java.lang.OutOfMemoryError:超出了GC开销限制 - Tomcat java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError:超出 GC 开销限制 - java.lang.OutOfMemoryError: GC overhead limit exceeded 超出Junit java.lang.OutOfMemoryError GC开销限制 - Junit java.lang.OutOfMemoryError GC overhead limit exceeded
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM