简体   繁体   English

Spark DataFrame java.lang.OutOfMemoryError:长时间运行超出了GC开销限制

[英]Spark DataFrame java.lang.OutOfMemoryError: GC overhead limit exceeded on long loop run

I'm running a Spark application (Spark 1.6.3 cluster), which does some calculations on 2 small data sets, and writes the result into an S3 Parquet file. 我正在运行一个Spark应用程序(Spark 1.6.3集群),该应用程序对2个小型数据集进行一些计算,并将结果写入S3 Parquet文件中。

Here is my code: 这是我的代码:

public void doWork(JavaSparkContext sc, Date writeStartDate, Date writeEndDate, String[] extraArgs) throws Exception {
    SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    S3Client s3Client = new S3Client(ConfigTestingUtils.getBasicAWSCredentials());

    boolean clearOutputBeforeSaving = false;
    if (extraArgs != null && extraArgs.length > 0) {
        if (extraArgs[0].equals("clearOutput")) {
            clearOutputBeforeSaving = true;
        } else {
            logger.warn("Unknown param " + extraArgs[0]);
        }
    }

    Date currRunDate = new Date(writeStartDate.getTime());
    while (currRunDate.getTime() < writeEndDate.getTime()) {
        try {

            SparkReader<FirstData> sparkReader = new SparkReader<>(sc);
            JavaRDD<FirstData> data1 = sparkReader.readDataPoints(
                    inputDir,
                    currRunDate,
                    getMinOfEndDateAndNextDay(currRunDate, writeEndDate));
            // Normalize to 1 hours & 0.25 degrees
            JavaRDD<FirstData> distinctData1 = data1.distinct();

            // Floor all (distinct) values to 6 hour windows
            JavaRDD<FirstData> basicData1BySixHours = distinctData1.map(d1 -> new FirstData(
                    d1.getId(),
                    TimeUtils.floorTimePerSixHourWindow(d1.getTimeStamp()),
                    d1.getLatitude(),
                    d1.getLongitude()));

            // Convert Data1 to Dataframes
            DataFrame data1DF = sqlContext.createDataFrame(basicData1BySixHours, FirstData.class);
            data1DF.registerTempTable("data1");

            // Read Data2 DataFrame
            String currDateString = TimeUtils.getSimpleDailyStringFromDate(currRunDate);
            String inputS3Path = basedirInput + "/dt=" + currDateString;
            DataFrame data2DF = sqlContext.read().parquet(inputS3Path);
            data2DF.registerTempTable("data2");

            // Join data1 and data2
            DataFrame mergedDataDF = sqlContext.sql("SELECT D1.Id,D2.beaufort,COUNT(1) AS hours " +
                    "FROM data1 as D1,data2 as D2 " +
                    "WHERE D1.latitude=D2.latitude AND D1.longitude=D2.longitude AND D1.timeStamp=D2.dataTimestamp " +
                    "GROUP BY D1.Id,D1.timeStamp,D1.longitude,D1.latitude,D2.beaufort");

            // Create histogram per ID
            JavaPairRDD<String, Iterable<Row>> mergedDataRows = mergedDataDF.toJavaRDD().groupBy(md -> md.getAs("Id"));
            JavaRDD<MergedHistogram> mergedHistogram = mergedDataRows.map(new MergedHistogramCreator());

            logger.info("Number of data1 results: " + data1DF.select("lId").distinct().count());
            logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());
            logger.info("Number of results with beaufort histograms: " + mergedDataDF.select("Id").distinct().count());

            // Save to parquet
            String outputS3Path = basedirOutput + "/dt=" + TimeUtils.getSimpleDailyStringFromDate(currRunDate);
            if (clearOutputBeforeSaving) {
                writeWithCleanup(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext, s3Client);
            } else {
                write(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext);
            }
        } finally {
            TimeUtils.progressToNextDay(currRunDate);
        }
    }
}

public void write(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass, SQLContext sqlContext) {
    // Apply a schema to an RDD of JavaBeans and save it as Parquet.
    DataFrame fullDataDF = sqlContext.createDataFrame(outputRDD, outputClass);
    fullDataDF.write().parquet(outputS3Path);
}

public void writeWithCleanup(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass,
                             SQLContext sqlContext, S3Client s3Client) {
    String fileKey = S3Utils.getS3Key(outputS3Path);
    String bucket = S3Utils.getS3Bucket(outputS3Path);

    logger.info("Deleting existing dir: " + outputS3Path);
    s3Client.deleteAll(bucket, fileKey);

    write(outputS3Path, outputRDD, outputClass, sqlContext);
}

public Date getMinOfEndDateAndNextDay(Date startTime, Date proposedEndTime) {
    long endOfDay = startTime.getTime() - startTime.getTime() % MILLIS_PER_DAY + MILLIS_PER_DAY ;
    if (endOfDay < proposedEndTime.getTime()) {
        return new Date(endOfDay);
    }
    return proposedEndTime;
}

The size of data1 is around 150,000 and data2 is around 500,000. data1的大小约为150,000,data2的大小约为500,000。

What my code does is basically does some data manipulation, merges the 2 data objects, does a bit more manipulation, prints some statistics and saves to parquet. 我的代码所做的基本上是做一些数据操作,合并两个数据对象,再做一些操作,打印一些统计数据并保存到镶木地板上。

The spark has 25GB of memory per server, and the code runs fine. spark每台服务器具有25GB的内存,并且代码运行良好。 Each iteration takes about 2-3 minutes. 每次迭代大约需要2-3分钟。

The problem starts when I run it on a large set of dates. 当我在大量日期上运行它时,问题就开始了。

After a while, I get an OutOfMemory: 一段时间后,我得到一个OutOfMemory:

java.lang.OutOfMemoryError: GC overhead limit exceeded
    at scala.collection.immutable.List.$colon$colon$colon(List.scala:127)
    at org.json4s.JsonDSL$JsonListAssoc.$tilde(JsonDSL.scala:98)
    at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:139)
    at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:72)
    at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
    at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:164)
    at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
    at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
    at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
    at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
    at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:38)
    at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:87)
    at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
    at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
    at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:71)
    at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
    at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:70)

Last time it ran, it crashed after 233 iterations. 上次运行时,它经过233次迭代后崩溃了。

The line it crashed on was this: 它崩溃的行是这样的:

logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());

Can anyone please tell me what can be the reason for the eventual crashes? 谁能告诉我最终崩溃的原因是什么?

This error occurs when GC takes up over 98% of the total execution time of process. 当GC占用进程总执行时间的98%以上时,会发生此错误。 You can monitor the GC time in your Spark Web UI by going to stages tab in http://master:4040 . 您可以通过转到http:// master:4040的“阶段”选项卡来监视Spark Web UI中的GC时间。

Try increasing the driver/executor(whichever is generating this error) memory using spark.{driver/executor}.memory by --conf while submitting the spark application. 提交spark应用程序时,请尝试通过--conf使用spark。{driver / executor} .memory增加驱动程序/执行程序(生成此错误的任何一个)的内存。

Another thing to try is to change the garbage collector that the java is using. 要尝试的另一件事是更改Java使用的垃圾收集器。 Read this article for that: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html . 请阅读以下文章: https : //databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html It very clearly explains why GC overhead error occurs and which garbage collector is best for your application. 它非常清楚地解释了为什么发生GC开销错误以及哪种垃圾收集器最适合您的应用程序。

I'm not sure that everyone will find this solution viable, but upgrading the Spark cluster to 2.2.0 seems to have resolved the issue. 我不确定每个人都会发现此解决方案可行,但是将Spark集群升级到2.2.0似乎已解决了该问题。

I have ran my application for several days now, and had no crashes yet. 我已经运行了几天我的应用程序,并且还没有崩溃。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark失败了java.lang.OutOfMemoryError:超出了GC开销限制? - Spark fails with java.lang.OutOfMemoryError: GC overhead limit exceeded? SPARK SQL java.lang.OutOfMemoryError:超出GC开销限制 - SPARK SQL java.lang.OutOfMemoryError: GC overhead limit exceeded 线程“dispatcher-event-loop-5”java.lang.OutOfMemoryError 中的异常:超出 GC 开销限制:Spark - Exception in thread "dispatcher-event-loop-5" java.lang.OutOfMemoryError: GC overhead limit exceeded : Spark Java PreparedStatement java.lang.OutOfMemoryError:超出了GC开销限制 - Java PreparedStatement java.lang.OutOfMemoryError: GC overhead limit exceeded 詹金斯 java.lang.OutOfMemoryError:超出 GC 开销限制 - Jenkins java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError:GC开销限制超出了android studio - java.lang.OutOfMemoryError: GC overhead limit exceeded android studio Gridgain:java.lang.OutOfMemoryError:超出了GC开销限制 - Gridgain: java.lang.OutOfMemoryError: GC overhead limit exceeded SonarQube java.lang.OutOfMemoryError:超出了GC开销限制 - SonarQube java.lang.OutOfMemoryError: GC overhead limit exceeded Tomcat java.lang.OutOfMemoryError:超出了GC开销限制 - Tomcat java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError:超出 GC 开销限制 - java.lang.OutOfMemoryError: GC overhead limit exceeded
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM