简体   繁体   中英

Spark Streaming exception handling strategies

I have a pyspark streaming job that streams a directory from s3 (using textFileStream ). Each line of input is parsed and output to parquet format on hdfs.

This works great under normal circumstances. However, what kind of options do I have for recovery of lost batches of data when one of the following error conditions occurs?

  • An exception occurs in the driver inside a call to foreachRDD , where output operations occur (possibly HdfsError , or a spark sql exception during output operations such as partitionBy or dataframe.write.parquet() ). As far as I know, this is classified as an "action" in Spark (vs. "transformation").
  • An exception occurs in an executor, perhaps because an exception occurred in a map() lambda while parsing a line.

The system I am building must be a system of record. All of my output semantics conform to the spark streaming documentation for exactly-once output semantics (if a batch/RDD has to be recomputed, output data will be overwritten, not duplicated).

How do I handle failures in my output action (inside foreachRDD )? AFAICT, exceptions that occur inside foreachRDD do not cause the streaming job to stop. In fact, I've tried to determine how to make unhandled exceptions inside foreachRDD to stop the job, and have been unable to do so.

Say an unhandled exception occurs in the driver. If I need to make a code change to resolve the exception, my understanding is that I would need to delete the checkpoint before resuming. In this scenario, is there a way to start the streaming job in the past from the timestamp at which the streaming job stopped?

Generally speaking every exception thrown inside function passed to mapPartitions-like operation ( map , filter , flatMap ) should be recoverable. There is simply no good reason for a whole action / transformation to fail on a single malformed input. Exact strategy will depend on your requirements (ignore, log, keep for further processing). You can find some ideas in What is the equivalent to scala.util.Try in pyspark?

Handling operation-wide failure is definitely harder. Since in general it can be not recoverable or waiting can be not an option due to incoming traffic I would optimistically retry in case of failure and if it doesn't succeed push raw data to an external backup system (S3 for example).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM