简体繁体 English

火花流失败的批次

[英]spark streaming failed batches

原文 2016-06-02 14:35:48 3 1 apache-spark/ spark-streaming

I see some failed batches in my spark streaming application because of memory related issues like 由于内存相关问题，我在火花流应用程序中看到一些失败的批处理

Could not compute split, block input-0-1464774108087 not found 无法计算拆分，找不到块输入0-1464774108087

, and I was wondering if there is a way to re process those batches on the side without messing with the current running application, just in general , does not have to be the same exact exception. ，而且我想知道是否有一种方法可以在不对当前正在运行的应用程序造成混乱的情况下在一边重新处理这些批处理，就一般而言，不必完全相同。

Thanks in advance Pradeep 在此先感谢Pradeep

1 个解决方案

This may happen in cases where your data ingestion rate into spark is higher than memory allocated or can be kept. 如果您的数据吸收到火花的速率高于分配的或可以保留的内存，则可能会发生这种情况。 You can try changing StorageLevel to MEMORY_AND_DISK_SER so that when it is low on memory Spark can spill data to disk. 您可以尝试将StorageLevel更改为MEMORY_AND_DISK_SER以便在内存不足时Spark可以将数据溢出到磁盘上。 This will prevent your error. 这样可以防止您的错误。

Also, I don't think this error means that any data was lost while processing, but that input block which was added by your block manager just timed out before processing started. 另外，我不认为此错误意味着在处理过程中任何数据都会丢失，但是您的块管理器添加的输入块在处理开始之前就已超时。

Check similar question on Spark User list . 在Spark用户列表中检查类似的问题。

Edit: 编辑：

Data is not lost, it was just not present where the task was expecting it to be. 数据不会丢失，只是不存在任务预期的位置。 As per Spark docs : 根据Spark文档：

You can mark an RDD to be persisted using the persist() or cache() methods on it. 您可以使用其上的persist（）或cache（）方法将一个RDD标记为持久。 The first time it is computed in an action, it will be kept in memory on the nodes. 第一次在操作中对其进行计算时，它将被保存在节点上的内存中。 Spark's cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it. Spark的缓存是容错的-如果RDD的任何分区丢失，它将使用最初创建它的转换自动重新计算。