简体   繁体   中英

Retrieve 1+ million records from Azure Table Storage

My table storage has approximately 1-2 million records and I have a daily job that needs needs to retrieve all the records that does not have a property A and do some further processing.

It is expected that there are about 1 - 1.5 million records without property A. I understand there are two approaches.

  1. Query all records then filter results after
  2. Do a table scan

Currently, it is using the approach where we query all records and filter in c#. However, the task is running in an Azure Function App. The query to retrieve all the results is sometimes taking over 10 minutes which is the limit for Azure Functions.

I'm trying to understand why retrieve 1 million records is taking so long and how to optimise the query. The existing design of the table is that the partition and row key are identical and is a guid - this leads me to believe that there is one entity per partition.

Looking at Microsoft docs, here are some key Table Storage limits ( https://docs.microsoft.com/en-us/azure/storage/common/storage-scalability-targets#azure-table-storage-scale-targets ):

  • Maximum request rate per storage account: 20,000 transactions per second, which assumes a 1-KiB entity size
  • Target throughput for a single table partition (1 KiB-entities): Up to 2,000 entities per second.

My initial guess is that I should use another partition key to group 2,000 entities per partition to achieve the target throughput of 2,000 per second per partition. Would this mean that 2,000,000 records could in theory be returned in 1 second?

Any thoughts or advice appreciated.

I found this question after blogging on the very topic. I have a project where I am using the Azure Functions Consumption plan and have a big Azure Storage Table (3.5 million records).

Here's my blog post: https://www.joelverhagen.com/blog/2020/12/distributed-scan-of-azure-tables

I have mentioned a couple of options in this blog post but I think the fastest is distributing the "table scan" work into smaller work items that can be easily completed in the 10-minute limit. I have an implementation linked in the blog post if you want to try it out. It will likely take some adapting to your Azure Function but most of the clever part (finding the partition key ranges) is implemented and tested.

This looks to be essentially what user3603467 is suggesting in his answer .

I see two approaches to retrieve 1+ records in a batch process, where the result must be saved to a single media - like a file.

First) You identity/select all primary id/key of related data. Then you spawn parallel jobs with chunks of these primary id/keys where you read the actual data and process it. each job then report to the single media with the result.

Second) You identity/select (for update) top n of related data, and mark this data with a state of being processed. Use concurrency locking here, that should prevent others from picking that data up if this is done in parallel.

I will go for the first solution if possible, since it is the simplest and cleanest solution. The second solution is best if you use "select for update", i dont know if that is supported on Azure Table Storage.

You'll need to paralise the task. As you don't know the partition keys, run 24 separate queries PK that start and end for each letter of the alaphabet. Write a query where PK > A && PK < B, and > B < C etc. Then join the 24 results in memory. Super easy to do in a single function. In JS just use Promise.all([]).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM