Java Spark - java.lang.OutOfMemoryError: GC overhead limit exceeded - Large Dataset

Question

We have a spark SQL query that returns over 5 million rows. Collecting them all for processing results in java.lang.OutOfMemoryError: GC overhead limit exceeded (eventually). Here's the code:

final Dataset<Row> jdbcDF = sparkSession.read().format("jdbc")
            .option("url", "xxxx")
            .option("driver", "com.ibm.db2.jcc.DB2Driver")
            .option("query", sql)
            .option("user", "xxxx")
            .option("password", "xxxx")
            .load();
    final Encoder<GdxClaim> gdxClaimEncoder = Encoders.bean(GdxClaim.class);
    final Dataset<GdxClaim> gdxClaimDataset = jdbcDF.as(gdxClaimEncoder);
    System.out.println("BEFORE PARALLELIZE");
    final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());
    System.out.println("AFTER");
    final JavaRDD<ClaimResponse> gdxClaimResponse = gdxClaimJavaRDD.mapPartitions(mapFunc);
    mapFunc = (FlatMapFunction<Iterator<GdxClaim>, ClaimResponse>) claim -> {
        System.out.println(":D " + claim.next().getRBAT_ID());
        if (claim != null && !currentRebateId.equals((claim.next().getRBAT_ID()))) {
            if (redisCommands == null || (claim.next().getRBAT_ID() == null)) {
                    serObjList = Collections.emptyList();
                } else {

                    generateYearQuarterKeys(claim.next());

                    redisBillingKeys = redisBillingKeys.stream().collect(Collectors.toList());
                    final String[] stringArray = redisBillingKeys.toArray(new String[redisBillingKeys.size()]);
                    serObjList = redisCommands.mget(stringArray);

                    serObjList = serObjList.stream().filter(clientObj -> clientObj.hasValue()).collect(Collectors.toList());
                    deserializeClientData(serObjList);
                    currentRebateId = (claim.next().getRBAT_ID());
            }
        }
        return (Iterator) racAssignmentService.assignRac(claim.next(), listClientRegArr);

    };

You can ignore most of this, the line that runs forever and never can return is:

final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());

Because of: gdxClaimDataset.collectAsList()

We are unsure where to go from here and totally stuck. Can anyone help? We've looked everywhere for some example to help.

Answer 1

At a high level, collectAsList() is going to bring your entire dataset into memory, and this is what you need to avoid doing.

You may want to look at the Dataset docs in general (same link as above). They explain its behavior, including the javaRDD() method, which is probably the way to avoid collectAsList() .

Keep in mind: other "terminal" operations, that collect your dataset into memory, will cause the same problem. The key is to filter down to your small subset, whatever that is, either before or during the process of collection.

Answer 2

Try to replace this line:

final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());

with:

final JavaRDD<GdxClaim> gdxClaimJavaRDD = gdxClaimDataset.javaRDD();

Java Spark - java.lang.OutOfMemoryError: GC overhead limit exceeded - Large Dataset

Question

2 answers

solution1
2 2019-11-20 21:30:08

solution2
1 ACCPTED 2019-11-20 21:11:46

Java Spark - java.lang.OutOfMemoryError: GC overhead limit exceeded - Large Dataset

Question

2 answers

solution1 2 2019-11-20 21:30:08

solution2 1 ACCPTED 2019-11-20 21:11:46

solution1
2 2019-11-20 21:30:08

solution2
1 ACCPTED 2019-11-20 21:11:46