简体繁体 English

从 Azure 表存储中检索 1+ 百万条记录

[英]Retrieve 1+ million records from Azure Table Storage

原文 2019-12-03 20:30:33 1 3 c#/ .net/ azure/ azure-functions/ azure-table-storage

My table storage has approximately 1-2 million records and I have a daily job that needs needs to retrieve all the records that does not have a property A and do some further processing.我的表存储有大约 1-2 百万条记录，我有一项日常工作需要检索所有没有属性 A 的记录并做一些进一步的处理。

It is expected that there are about 1 - 1.5 million records without property A. I understand there are two approaches.预计大约有 1 - 150 万条记录没有属性 A。我知道有两种方法。

Query all records then filter results after查询所有记录，然后过滤结果
Do a table scan进行表扫描

Currently, it is using the approach where we query all records and filter in c#.目前，它使用我们在c#中查询所有记录并过滤的方法。 However, the task is running in an Azure Function App.但是，该任务正在 Azure Function App 中运行。 The query to retrieve all the results is sometimes taking over 10 minutes which is the limit for Azure Functions.检索所有结果的查询有时需要超过 10 分钟，这是 Azure Functions 的限制。

I'm trying to understand why retrieve 1 million records is taking so long and how to optimise the query.我试图了解为什么检索 100 万条记录需要这么长时间以及如何优化查询。 The existing design of the table is that the partition and row key are identical and is a guid - this leads me to believe that there is one entity per partition.表的现有设计是分区和行键是相同的并且是一个 guid - 这让我相信每个分区有一个实体。

Looking at Microsoft docs, here are some key Table Storage limits ( https://docs.microsoft.com/en-us/azure/storage/common/storage-scalability-targets#azure-table-storage-scale-targets ):查看 Microsoft 文档，这里有一些关键的表存储限制（ https://docs.microsoft.com/en-us/azure/storage/common/storage-scalability-targets#azure-table-storage-scale-targets ）：

Maximum request rate per storage account: 20,000 transactions per second, which assumes a 1-KiB entity size每个存储帐户的最大请求率：每秒 20,000 个事务，假设实体大小为 1-KiB
Target throughput for a single table partition (1 KiB-entities): Up to 2,000 entities per second.单个表分区（1 KiB 实体）的目标吞吐量：每秒最多 2,000 个实体。

My initial guess is that I should use another partition key to group 2,000 entities per partition to achieve the target throughput of 2,000 per second per partition.我最初的猜测是我应该使用另一个分区键对每个分区 2,000 个实体进行分组，以实现每个分区每秒 2,000 个的目标吞吐量。 Would this mean that 2,000,000 records could in theory be returned in 1 second?这是否意味着理论上可以在 1 秒内返回 2,000,000 条记录？

Any thoughts or advice appreciated.任何想法或建议表示赞赏。

3 个解决方案

I found this question after blogging on the very topic.我在写博客后发现了这个问题。 I have a project where I am using the Azure Functions Consumption plan and have a big Azure Storage Table (3.5 million records).我有一个项目，我在其中使用 Azure Functions Consumption 计划并有一个很大的 Azure 存储表（350 万条记录）。

Here's my blog post: https://www.joelverhagen.com/blog/2020/12/distributed-scan-of-azure-tables这是我的博客文章： https : //www.joelverhagen.com/blog/2020/12/distributed-scan-of-azure-tables

I have mentioned a couple of options in this blog post but I think the fastest is distributing the "table scan" work into smaller work items that can be easily completed in the 10-minute limit.我在这篇博文中提到了几个选项，但我认为最快的是将“表扫描”工作分配到更小的工作项目中，这些工作项目可以在 10 分钟的限制内轻松完成。 I have an implementation linked in the blog post if you want to try it out.如果您想尝试一下，我在博客文章中链接了一个实现。 It will likely take some adapting to your Azure Function but most of the clever part (finding the partition key ranges) is implemented and tested.可能需要适应您的 Azure 函数，但大部分聪明的部分（找到分区键范围）都已实现和测试。

This looks to be essentially what user3603467 is suggesting in his answer .这看起来本质上就是user3603467在他的回答中所建议的。

I see two approaches to retrieve 1+ records in a batch process, where the result must be saved to a single media - like a file.我看到两种在批处理中检索 1+ 条记录的方法，其中结果必须保存到单个媒体 - 就像一个文件。

First) You identity/select all primary id/key of related data.首先）您标识/选择相关数据的所有主 ID/键。 Then you spawn parallel jobs with chunks of these primary id/keys where you read the actual data and process it.然后，您使用这些主要 id/key 的块生成并行作业，您可以在其中读取实际数据并对其进行处理。 each job then report to the single media with the result.然后每个作业将结果报告给单个媒体。

Second) You identity/select (for update) top n of related data, and mark this data with a state of being processed.第二）您标识/选择（更新）相关数据的前 n 个，并将这些数据标记为正在处理的状态。 Use concurrency locking here, that should prevent others from picking that data up if this is done in parallel.在这里使用并发锁定，这应该可以防止其他人在并行完成时获取该数据。

I will go for the first solution if possible, since it is the simplest and cleanest solution.如果可能，我会选择第一个解决方案，因为它是最简单、最干净的解决方案。 The second solution is best if you use "select for update", i dont know if that is supported on Azure Table Storage.如果您使用“选择更新”，则第二种解决方案是最好的，我不知道 Azure 表存储是否支持它。

You'll need to paralise the task.您需要对任务进行Paralise。 As you don't know the partition keys, run 24 separate queries PK that start and end for each letter of the alaphabet.由于您不知道分区键，因此运行 24 个单独的查询 PK，这些查询以字母表的每个字母开始和结束。 Write a query where PK > A && PK < B, and > B < C etc. Then join the 24 results in memory.写一个查询 where PK > A && PK < B, and > B < C etc. 然后在内存中加入 24 个结果。 Super easy to do in a single function.在单个功能中超级容易做到。 In JS just use Promise.all([]).在 JS 中只需使用 Promise.all([])。