简体繁体 English

Spark Streaming异常处理策略

[英]Spark Streaming exception handling strategies

原文 2015-11-13 04:59:25 8 1 hadoop/ apache-spark/ pyspark/ spark-streaming

I have a pyspark streaming job that streams a directory from s3 (using textFileStream ). 我有一个pyspark流作业，可从s3流目录（使用textFileStream ）。 Each line of input is parsed and output to parquet format on hdfs. 输入的每一行都经过解析，并在hdfs上输出为拼花格式。

This works great under normal circumstances. 在正常情况下，这很好用。 However, what kind of options do I have for recovery of lost batches of data when one of the following error conditions occurs? 但是，出现以下错误情况之一时，我有什么选择来恢复丢失的批量数据？

An exception occurs in the driver inside a call to foreachRDD , where output operations occur (possibly HdfsError , or a spark sql exception during output operations such as partitionBy or dataframe.write.parquet() ). 在驱动程序中对foreachRDD的调用中发生异常，在该调用中发生输出操作（可能是HdfsError ，或者在输出操作（例如partitionBy或dataframe.write.parquet() ）期间发生了sql异常。 As far as I know, this is classified as an "action" in Spark (vs. "transformation"). 据我所知，这在Spark中被归类为“动作”（相对于“转换”）。
An exception occurs in an executor, perhaps because an exception occurred in a map() lambda while parsing a line. 执行程序中发生异常，可能是因为解析行时map（）lambda中发生了异常。

The system I am building must be a system of record. 我正在建立的系统必须是记录系统。 All of my output semantics conform to the spark streaming documentation for exactly-once output semantics (if a batch/RDD has to be recomputed, output data will be overwritten, not duplicated). 我所有的输出语义都符合Spark Streaming文档中关于一次准确的输出语义（如果必须重新计算批处理/ RDD，则输出数据将被覆盖，而不是重复）。

How do I handle failures in my output action (inside foreachRDD )? 如何处理输出操作中的失败（在foreachRDD内部）？ AFAICT, exceptions that occur inside foreachRDD do not cause the streaming job to stop. AFAICT， foreachRDD内部发生的异常不会导致流作业停止。 In fact, I've tried to determine how to make unhandled exceptions inside foreachRDD to stop the job, and have been unable to do so. 实际上，我已经尝试确定如何在foreachRDD创建未处理的异常以停止该工作，但一直无法这样做。

Say an unhandled exception occurs in the driver. 假设驱动程序中发生未处理的异常。 If I need to make a code change to resolve the exception, my understanding is that I would need to delete the checkpoint before resuming. 如果需要更改代码以解决该异常，则我的理解是我需要在继续之前删除检查点。 In this scenario, is there a way to start the streaming job in the past from the timestamp at which the streaming job stopped? 在这种情况下，是否有一种方法可以从过去停止流作业的时间戳记开始过去的流作业？

1 个解决方案

Generally speaking every exception thrown inside function passed to mapPartitions-like operation ( map , filter , flatMap ) should be recoverable. 一般来说，传递给类似mapPartitions的操作（ map ， filter ， flatMap ）的函数内部抛出的每个异常都应该是可恢复的。 There is simply no good reason for a whole action / transformation to fail on a single malformed input. 完全没有充分的理由使整个操作/转换因单个格式错误的输入而失败。 Exact strategy will depend on your requirements (ignore, log, keep for further processing). 确切的策略将取决于您的要求（忽略，记录，保留以进行进一步处理）。 You can find some ideas in What is the equivalent to scala.util.Try in pyspark? 您可以在pyspark中的scala.util.Try等效项中找到一些想法。

Handling operation-wide failure is definitely harder. 处理整个操作范围的故障肯定更加困难。 Since in general it can be not recoverable or waiting can be not an option due to incoming traffic I would optimistically retry in case of failure and if it doesn't succeed push raw data to an external backup system (S3 for example). 由于一般来说，由于传入的流量，它无法恢复或等待不是一种选择，因此我会在出现故障时乐观地重试，如果失败则将原始数据推送到外部备份系统（例如S3）。