简体繁体 English

来自 eventhub 的 Spark 流：一旦没有更多数据，如何停止流？

[英]Spark streaming from eventhub: how to stop stream once there is no more data?

原文 2021-07-09 16:30:22 6 1 python/ apache-spark/ pyspark/ azure-databricks

What I am trying to do, is to read some data from my event hub, and save it in azure data lake.我想要做的是从我的事件中心读取一些数据，并将其保存在 azure 数据湖中。 However, the issue is, that the stream doesn't stop, and the writeStream step is not triggered.但是，问题是流不会停止，并且不会触发writeStream步骤。 I am not able to find any setting to identify when the input rate reaches 0 in order to stop the stream then.我无法找到任何设置来识别输入速率何时达到 0 以便停止流。

1 个解决方案

There is a special trigger in Apache Spark often called Trigger.Once - it will process all available data, and then shutdown the stream. Apache Spark 中有一个特殊的触发器，通常称为Trigger.Once - 它会处理所有可用数据，然后关闭流。 Just add the .trigger(once=True) after .writeStream to enable it.只需添加.trigger(once=True)后.writeStream来启用它。

The only problem with it is that in Spark 3.x (DBR >= 7.x), it completely ignore options like maxFilesPerTrigger , etc. that are limiting an amount of data pulled for processing - in this case it will try to process all data in one go, and sometimes it may lead to a performance problems.唯一的问题是在 Spark 3.x (DBR >= 7.x) 中，它完全忽略了maxFilesPerTrigger等选项，这些选项限制了用于处理的数据量 - 在这种情况下，它将尝试处理所有数据一口气，有时可能会导致性能问题。 To workaround that you may do following hack - assign result of raw_data.writeStream.....start() , like query = raw_data.writeStream.... to a variable - and check periodically the value of query.get('numInputRows') , and if it's equal to 0 for a some period of time, issue query.stop()要解决此问题，您可以按照 hack 进行操作 - 将raw_data.writeStream.....start() ，例如query = raw_data.writeStream....分配给变量 - 并定期检查query.get('numInputRows') ，如果它在一段时间内等于 0，则发出query.stop()