简体   繁体   中英

Java Spark - java.lang.OutOfMemoryError: GC overhead limit exceeded - Large Dataset

We have a spark SQL query that returns over 5 million rows. Collecting them all for processing results in java.lang.OutOfMemoryError: GC overhead limit exceeded (eventually). Here's the code:

final Dataset<Row> jdbcDF = sparkSession.read().format("jdbc")
            .option("url", "xxxx")
            .option("driver", "com.ibm.db2.jcc.DB2Driver")
            .option("query", sql)
            .option("user", "xxxx")
            .option("password", "xxxx")
            .load();
    final Encoder<GdxClaim> gdxClaimEncoder = Encoders.bean(GdxClaim.class);
    final Dataset<GdxClaim> gdxClaimDataset = jdbcDF.as(gdxClaimEncoder);
    System.out.println("BEFORE PARALLELIZE");
    final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());
    System.out.println("AFTER");
    final JavaRDD<ClaimResponse> gdxClaimResponse = gdxClaimJavaRDD.mapPartitions(mapFunc);
    mapFunc = (FlatMapFunction<Iterator<GdxClaim>, ClaimResponse>) claim -> {
        System.out.println(":D " + claim.next().getRBAT_ID());
        if (claim != null && !currentRebateId.equals((claim.next().getRBAT_ID()))) {
            if (redisCommands == null || (claim.next().getRBAT_ID() == null)) {
                    serObjList = Collections.emptyList();
                } else {

                    generateYearQuarterKeys(claim.next());

                    redisBillingKeys = redisBillingKeys.stream().collect(Collectors.toList());
                    final String[] stringArray = redisBillingKeys.toArray(new String[redisBillingKeys.size()]);
                    serObjList = redisCommands.mget(stringArray);

                    serObjList = serObjList.stream().filter(clientObj -> clientObj.hasValue()).collect(Collectors.toList());
                    deserializeClientData(serObjList);
                    currentRebateId = (claim.next().getRBAT_ID());
            }
        }
        return (Iterator) racAssignmentService.assignRac(claim.next(), listClientRegArr);

    };

You can ignore most of this, the line that runs forever and never can return is:

final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());

Because of: gdxClaimDataset.collectAsList()

We are unsure where to go from here and totally stuck. Can anyone help? We've looked everywhere for some example to help.

At a high level, collectAsList() is going to bring your entire dataset into memory, and this is what you need to avoid doing.

You may want to look at the Dataset docs in general (same link as above). They explain its behavior, including the javaRDD() method, which is probably the way to avoid collectAsList() .

Keep in mind: other "terminal" operations, that collect your dataset into memory, will cause the same problem. The key is to filter down to your small subset, whatever that is, either before or during the process of collection.

Try to replace this line:

final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());

with:

final JavaRDD<GdxClaim> gdxClaimJavaRDD = gdxClaimDataset.javaRDD();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM