简体繁体中英

Spark Streaming exception handling strategies

原文 2015-11-13 04:59:25 4 1 hadoop/ apache-spark/ pyspark/ spark-streaming

I have a pyspark streaming job that streams a directory from s3 (using textFileStream ). Each line of input is parsed and output to parquet format on hdfs.

This works great under normal circumstances. However, what kind of options do I have for recovery of lost batches of data when one of the following error conditions occurs?

An exception occurs in the driver inside a call to foreachRDD , where output operations occur (possibly HdfsError , or a spark sql exception during output operations such as partitionBy or dataframe.write.parquet() ). As far as I know, this is classified as an "action" in Spark (vs. "transformation").
An exception occurs in an executor, perhaps because an exception occurred in a map() lambda while parsing a line.

The system I am building must be a system of record. All of my output semantics conform to the spark streaming documentation for exactly-once output semantics (if a batch/RDD has to be recomputed, output data will be overwritten, not duplicated).

How do I handle failures in my output action (inside foreachRDD )? AFAICT, exceptions that occur inside foreachRDD do not cause the streaming job to stop. In fact, I've tried to determine how to make unhandled exceptions inside foreachRDD to stop the job, and have been unable to do so.

Say an unhandled exception occurs in the driver. If I need to make a code change to resolve the exception, my understanding is that I would need to delete the checkpoint before resuming. In this scenario, is there a way to start the streaming job in the past from the timestamp at which the streaming job stopped?

1 answers

Generally speaking every exception thrown inside function passed to mapPartitions-like operation ( map , filter , flatMap ) should be recoverable. There is simply no good reason for a whole action / transformation to fail on a single malformed input. Exact strategy will depend on your requirements (ignore, log, keep for further processing). You can find some ideas in What is the equivalent to scala.util.Try in pyspark?

Handling operation-wide failure is definitely harder. Since in general it can be not recoverable or waiting can be not an option due to incoming traffic I would optimistically retry in case of failure and if it doesn't succeed push raw data to an external backup system (S3 for example).

Spark Structured Streaming with secured Kafka throwing : Not authorized to access group exception

Spark Streaming Exception: java.util.NoSuchElementException: None.get

Spark Streaming contain text

Spark streaming and mocking hdfs

Spark Streaming textFileStream COPYING

Persisting Spark Streaming output

Spark Streaming: HDFS

Reading fileStream with Spark Streaming

Drools In Spark for Streaming File

Stateful and Stateless Streaming (Spark)

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark Structured Streaming with secured Kafka throwing : Not authorized to access group exception Spark Streaming Exception: java.util.NoSuchElementException: None.get Spark Streaming contain text Spark streaming and mocking hdfs Spark Streaming textFileStream COPYING Persisting Spark Streaming output Spark Streaming: HDFS Reading fileStream with Spark Streaming Drools In Spark for Streaming File Stateful and Stateless Streaming (Spark)

Related Tags

Spark Streaming exception handling strategies

Question

1 answers

solution1 4 ACCPTED

solution1
4 ACCPTED