Spark Structured Streaming：处理负载是否会影响输入速率/numInputRecords？

Question

My current Structured Streaming application writes to a huge Delta table.我当前的结构化流应用程序写入一个巨大的 Delta 表。 When I (stop the stream) and point it to write to a brand new delta table:当我（停止流）并将其指向一个全新的增量表时：

It becomes much faster - batch duration drops almost 1/4th它变得更快 - 批处理持续时间几乎下降了 1/4
The input rate increases almost 3 times输入率提高近 3 倍

I understand it might become faster since any aggregations/writes it is doing on the older/bigger table is not needed on the new table.我知道它可能会变得更快，因为它在旧/更大的表上进行的任何聚合/写入都不需要在新表上。 But the input rate change is something I am hoping someone can explain?但是输入速率的变化是我希望有人能解释的吗？

Source is Azure EventHubs.来源是 Azure EventHubs。

Thanks!谢谢！

Answer 1

Answering my own question:回答我自己的问题：

The logic behind InputRate and Processing Rate seems to be the below: InputRate 和 Processing Rate 背后的逻辑似乎如下：

Input rate =  numInputRows (or batch size )/ Trigger Interval in secs
Processing Rate = numInputRows (or batch size )/ Batch Duration in secs

without trigger interval, they shd be the same because BatchDuration = Trigger Interval.如果没有触发间隔，它们应该是相同的，因为 BatchDuration = Trigger Interval。

So with a bigger table with lot of partitions, the writes and aggregates take longer which incerases the Batch Duration and thereby decreases the InputRate (and Processing Rate).因此，对于具有大量分区的更大表，写入和聚合需要更长的时间，这会增加批处理持续时间，从而降低 InputRate（和处理速率）。 So that should explain the opposite case for smaller target tables having faster input/processing rates.所以这应该解释具有更快输入/处理速率的较小目标表的相反情况。

Spark Structured Streaming：处理负载是否会影响输入速率/numInputRecords？

问题描述

1 个解决方案

解决方案1
0 2021-02-26 16:58:23

Spark Structured Streaming：处理负载是否会影响输入速率/numInputRecords？

问题描述

1 个解决方案

解决方案1 0 2021-02-26 16:58:23

解决方案1
0 2021-02-26 16:58:23