简体繁体 English

Azure 数据工厂复制到 CosmosDB 节流

[英]Azure Data Factory copy to CosmosDB throttling

原文 2022-03-11 19:45:08 9 1 azure/ azure-cosmosdb/ azure-data-factory/ throttling

I have an Azure Data Factory pipeline that executes a 'Copy' step which takes a blob file with JSON data and copies it over to my CosmosDB.我有一个 Azure 数据工厂管道，该管道执行“复制”步骤，该步骤采用带有 JSON 数据的 blob 文件并将其复制到我的 CosmosDB。

The blob file is 75MB and my CosmosDB is scaled to 10.000 RU's (autoscale). blob 文件为 75MB，我的 CosmosDB 已缩放到 10.000 RU（自动缩放）。 The Azure Data Factory pipeline takes about 5 mins to copy over all the data but the main problem is that the CosmosDB is throttling because of the many requests. Azure 数据工厂管道大约需要 5 分钟来复制所有数据，但主要问题是 CosmosDB 由于请求过多而受到限制。 When checking out the metrics page the 'Normalized RU Consumption' spikes to 100% instantly.查看指标页面时，“标准化 RU 消耗”立即飙升至 100%。

I have been looking for a solution where the Data Factory pipeline just spends more time on the copy step instead of trying it this fast.我一直在寻找一种解决方案，其中数据工厂管道只是在复制步骤上花费更多时间而不是这么快地尝试它。 I tried adjusting the settings in the 'Copy' step in Data Factory but that did not change anything at all.我尝试在数据工厂的“复制”步骤中调整设置，但这根本没有改变任何东西。

Is there another way to make sure that the Data Factory pipeline does not consume all the RU's?是否有另一种方法可以确保数据工厂管道不会消耗所有 RU？ It is no problem that the pipeline would run 1 hour+.管道运行 1 小时以上是没有问题的。 Current issue now is that my CosmosDB Database is unavailable at this time because the Data Factory is taking up all the RU's.当前的问题是我的 CosmosDB 数据库此时不可用，因为数据工厂占用了所有 RU。 Other requests are then returned a 429 'Too many requests'.然后其他请求返回 429“请求太多”。

Any suggestions are welcome!欢迎任何建议！

EDIT: I have upscaled my CosmosDB to 50.000 RU's just to test out.编辑：我已经将我的 CosmosDB 升级到 50.000 RU 只是为了测试。 The data factory pipeline was successful in 2 minutes now.数据工厂管道现在在 2 分钟内成功。 That is good improvement, but it still occupied 100% of the RU's and the database was not available for about 5 minutes (I think CosmosDB still does some tasks after the data factory pipeline got succeeded).这是一个很好的改进，但它仍然占据了 100% 的 RU，并且数据库在大约 5 分钟内不可用（我认为 CosmosDB 在数据工厂管道成功后仍在执行一些任务）。 This is what I'd like to prevent, the 100% spikes.这就是我想要防止的，即 100% 的尖峰。 It would be ideal that only 50% RU's are utilised and it take double the time.理想的情况是只使用 50% 的 RU，并且需要双倍的时间。 Would this be possible?这可能吗？

1 个解决方案

I do not know any way to set a simple RU limit.我不知道有什么方法可以设置简单的 RU 限制。 But...但...

Last time I was in this position it did seem to help to manually limit integration units and parallelism to small numbers .上次我在这个 position 中时，它似乎确实有助于手动将集成单元和并行度限制为较小的数字。 Smaller number of clients should put SOME upper limit on write throughput.较少数量的客户端应该对写入吞吐量设置一些上限。 It's not exact science, though.但这不是精确的科学。 May be it depends on input source type how streamed/parallel the data reading part is to begin with.可能取决于输入源类型数据读取部分开始时的流式/并行方式。

Another delaying measure was to set higher retry interval for copy action.另一个延迟措施是为复制操作设置更高的重试间隔。 This way when ADF itself got throttled, it would create openings for other clients to be served.这样，当 ADF 本身受到限制时，它会为要服务的其他客户端创造机会。 At the cost of increased duration and cost of ADF run.以增加 ADF 运行的持续时间和成本为代价。 Most likely OK for one-time actions.对于一次性操作，很可能是可以的。

I also played with sink write batch size to reduce protocol chattiness and improve overall ingestion time.我还尝试使用接收器写入批量大小来减少协议的繁琐并改善整体摄取时间。 Not so sure how if it affected overall RU usage, so it is most likely an aspect to balance.不太确定它是否会影响整体 RU 使用，因此它很可能是一个需要平衡的方面。

Another trick you could use is to partition input file to smaller chunks in ADF, push the batches sequentially and introduce small delays between batches yourself to keep some RU available for others between every batch..您可以使用的另一个技巧是将输入文件划分为 ADF 中的较小块，按顺序推送批次并自己在批次之间引入小的延迟，以在每个批次之间为其他人保留一些 RU。