简体   繁体   English

Spark结构化流立即抛出Java OOM

[英]Spark Structured Streaming throwing Java OOM immediately

I am trying to build a simple pipeline using Kafka as streaming source to Spark's structured streaming APIs, performing group-by aggregations and persisting the results to HDFS. 我正在尝试使用Kafka作为Spark的结构化流API的流源来构建简单的管道,执行分组聚合并将结果持久保存到HDFS。

But, as soon as I submit the job, I am getting Java heap space error even though the streaming data is very less in volume. 但是,一旦提交作业,即使流数据量非常小,我也会收到Java堆空间错误。

Below is the code in pyspark: 以下是pyspark中的代码:

allEvents =spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe","MyNewTopic") \
    .option("group.id","aggStream") \
    .option("startingOffsets", "earliest") \
    .load() \
    .select(col("value").cast("string"))

aaIDF = allEvents.filter(col("value").contains("myNewAPI")).select(from_json(col("value"),aaISchema) \
 .alias("colName")).select(col("colName.eventTime"), col("colName.appId"),col("colName.articleId"),col("colName.locale"),col("colName.impression"))

windowedCountsDF = aaIDF.withWatermark("eventTime","10 minutes") \
    .groupBy("appId","articleId","locale",window("eventTime", "2 minutes")).sum("impression").withColumnRenamed("sum(impression)", "views")


query = windowedCountsDF \
    .writeStream \
    .outputMode("append") \
    .format("parquet") \
    .option("path", "/CDS/events/JS/agg/" + strftime("%Y/%m/%d/%H/%M", gmtime()) + "/") \
    .option("checkpointLocation", "/CDS/checkpoint/").start()

And below is the exception: 以下是例外:

17/11/23 14:24:45 ERROR Utils: Aborting task
java.lang.OutOfMemoryError: Java heap space
    at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
    at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:214)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

Two possible reasons: 可能的两个原因:

  1. Your watermark set does not take effect. 您的水印集无效。 You should reference the column with colName.eventTime . 您应该使用colName.eventTime引用该列。

    Since no watermark is defined (only defined in other category), old aggregation state is not dropped. 由于未定义水印(仅在其他类别中定义),因此不会删除旧的聚合状态。

  2. You should set a bigger value to --driver-memory or --executor-memory for Spark. 你应该设置一个较大的值--driver-memory--executor-memory的火花。

You need to have appropriate driver and execute memory set while submitting the jobs. 提交作业时,您需要具有适当的驱动程序并执行内存设置。 This post gives you a brief idea of how to set these configurations. 这篇文章简要介绍了如何设置这些配置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM