Azure表存储多行查询性能

Question

我们在利用Azure表存储的服务中遇到问题，有时查询需要花费几秒钟（3到30秒）。 这每天都会发生，但仅适用于某些查询。 我们在服务和表存储上没有巨大的负担（每小时大约有数百个呼叫）。 但是表存储仍然没有执行。

慢速查询都在执行过滤器查询，最多返回10行。 我对过滤器进行了结构化设计，以便始终有一个分区键和行键，在or或运算符之后，接着是下一对分区键和行键：

(partitionKey1 and RowKey1) or (partitionKey2 and rowKey2) or (partitionKey3 and rowKey3)

因此，目前我需要以将查询拆分为单独的查询为前提。 我使用python脚本对此进行了某种程度的验证。 当我重复与单个查询相同的查询（带有或的组合查询，并期望有多行结果）或拆分成多个单独执行的多个查询时，我看到组合查询有时会变慢。

import time
import threading
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity

############################################################################
# Script for querying data from azure table storage or cosmos DB table API.
# SAS token needs to be generated for using this script and a table with data 
# needs to exist.
#
# Warning: extensive use of this script may burden the table performance, 
#          so use with care.
#
# PIP requirements:
#  - requires azure-cosmosdb-table to be installed
#     * run: 'pip install azure-cosmosdb-table'

dateTimeSince = '2019-06-12T13:16:45.446Z'

sasToken = 'SAS_TOKEN_HERE' 
tableName = 'TABLE_NAME_HERER'

table_service = TableService(account_name="ACCOUNT_NAME_HERE", sas_token=sasToken)

tableFilter = "(PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_ed6d31b0') and (RowKey eq 'ed6d31b0-d2a3-4f18-9d16-7f72cbc88cb3') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9be86f34') and (RowKey eq '9be86f34-865b-4c0f-8ab0-decf928dc4fc') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_97af3bdc') and (RowKey eq '97af3bdc-b827-4451-9cc4-a8e7c1190d17') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9d557b56') and (RowKey eq '9d557b56-279e-47fa-a104-c3ccbcc9b023') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_e251a31a') and (RowKey eq 'e251a31a-1aaa-40a8-8cde-45134550235c')"

resultDict = {}

# Do separate queries

filters = tableFilter.split(" or ")
threads = []

def runQueryPrintResult(filter):
    result = table_service.query_entities(table_name=tableName, filter=filter)
    item = result.items[0]
    resultDict[item.RowKey] = item

# Loop where: 
# - Step 1: test is run with tableFilter query split to multiple threads
#      * returns single  row per query
# - Step 2: Query is runs tableFilter query as single query
# - Press enter to repeat the two query tests
while 1:
    start2 = time.time()
    for filter in filters:
        x = threading.Thread(target=runQueryPrintResult, args=(filter,))
        x.start()
        threads.append(x)

    for x in threads:
        x.join()

    end2 = time.time()
    print("Time elapsed with multi threaded implementation: {}".format(end2-start2))

    # Do single query
    start1 = time.time()
    listGenerator = table_service.query_entities(table_name=tableName, filter=tableFilter)
    end1 = time.time()
    print("Time elapsed with single query: {}".format(end1-start1))

    counter = 0
    allVerified = True
    for item in listGenerator:
        if resultDict[item.RowKey]:
            counter += 1
        else:
            allVerified = False

    if len(listGenerator.items) != len(resultDict):
        allVerified = False

    print("table item count since x: " + str(counter))

    if allVerified:
        print("Both queries returned same amount of results")
    else:
        print("Result count does not match, single threaded count={}, multithreaded count={}".format(len(listGenerator.items), len(resultDict)))

    input('Press enter to retry test!')

这是python代码的示例输出：

Time elapsed with multi threaded implementation: 0.10776209831237793
Time elapsed with single query: 0.2323908805847168
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.0897986888885498
Time elapsed with single query: 0.21547174453735352
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.08280491828918457
Time elapsed with single query: 3.2932426929473877
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.07794523239135742
Time elapsed with single query: 1.4898555278778076
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.07962584495544434
Time elapsed with single query: 0.20011520385742188
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!

我们遇到问题的服务虽然是在C＃中实现的，但是我还没有再现在C＃端使用python脚本获得的结果。 将查询拆分为多个单独的查询而不是使用单个过滤器查询（返回所有必需的行）时，我的性能似乎更差。

因此，执行以下多次并等待所有操作完成似乎较慢：

TableOperation getOperation =
                TableOperation.Retrieve<HqrScreenshotItemTableEntity>(partitionKey, id.ToString());
            TableResult result = await table.ExecuteAsync(getOperation);

比在单个查询中全部完成：

        private IEnumerable<MyTableEntity> GetBatchedItemsTableResult(Guid[] ids, string applicationLink)
        {
            var table = InitializeTableStorage();

            TableQuery<MyTableEntity> itemsQuery= 
                new TableQuery<MyTableEntity>().Where(TableQueryConstructor(ids, applicationLink));

            IEnumerable<MyTableEntity> result = table.ExecuteQuery(itemsQuery);

            return result;
        }

        public string TableQueryConstructor(Guid[] ids, string applicationLink)
        {
            var fullQuery = new StringBuilder();

            foreach (var id in ids)
            {
                    // Encode link before setting to partition key as REST GET requests 
                    // do not accept non encoded URL params by default)
                    partitionKey = HttpUtility.UrlEncode(applicationLink);


                // Create query for single row in a requested partition
                string queryForRow = TableQuery.CombineFilters(
                    TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, partitionKey),
                    TableOperators.And,
                    TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, id.ToString()));

                if (fullQuery.Length == 0)
                {
                    // Append query for first row

                    fullQuery.Append(queryForRow);
                }
                else
                {
                    // Append query for subsequent rows with or operator to make queries independent of each other.

                    fullQuery.Append($" {TableOperators.Or} ");
                    fullQuery.Append(queryForRow);
                }
            }

            return fullQuery.ToString();
        }

尽管与python测试不同，但与C＃代码一起使用的测试用例却大不相同。 在C＃中，我从类似100000行的数据中查询2000行。 如果按50行的批次查询数据，则后一个过滤器查询将胜过在50个任务中运行的单行查询。

也许我应该重复在C＃中作为控制台应用程序对python所做的测试，以查看.Net客户端api的行为方式是否与python perf Vice相同。

Answer 1

我认为您应该使用多线程实现，因为它包含多个Point Query 。 在单个查询中执行全部操作可能会导致表扫描 。 如官方文档所述：

使用“” 或 “”来指定基于RowKey值的过滤器会导致分区扫描，并且不会被视为范围查询。 因此，应避免使用诸如$ filter = PartitionKey eq'Sales'和（RowKey eq'121'或RowKey eq'322'）的查询。

您可能会认为上面的示例是两个点查询 ，但是实际上会导致分区扫描 。

Answer 2

发布评论作为答案，因为它的评论量越来越大。

您能否通过将查询更改为以下内容来进行尝试：

(PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_ed6d31b0' and RowKey eq 'ed6d31b0-d2a3-4f18-9d16-7f72cbc88cb3') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9be86f34' and RowKey eq '9be86f34-865b-4c0f-8ab0-decf928dc4fc') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_97af3bdc' and RowKey eq '97af3bdc-b827-4451-9cc4-a8e7c1190d17') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9d557b56' and RowKey eq '9d557b56-279e-47fa-a104-c3ccbcc9b023') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_e251a31a' and RowKey eq 'e251a31a-1aaa-40a8-8cde-45134550235c')

Answer 3

对我而言，这里的答案似乎是，未对表存储上执行的查询进行优化以使其与您期望的OR运算符一起使用。 当查询与OR运算符结合使用时，查询不作为点查询处理。

这可以在python，C＃和Azure存储资源管理器中复制，在所有这些情况下，如果将点查询与OR结合使用，则比单独返回仅一行的点查询要慢10倍（甚至更多）。

因此，获得具有分区键和行键的行数的最有效方法是使用TableOperation.Retrieve （在C＃中）用单独的异步查询来完成所有这些操作。 使用TableQuery的效率非常低，并且不会在Azure Table Storage的性能可扩展性目标所期望的任何地方产生结果。 可扩展性目标例如说：“单个表分区（1 KiB实体）的目标吞吐量每秒最多2,000个实体”。 尽管所有行都在不同的分区中，但我什至无法每秒提供5行。

在任何文档或性能优化指南的任何地方都没有非常清楚地说明查询性能的这一限制，但是可以从Azure存储性能清单中的以下几行中了解到：

查询

本节描述了用于查询表服务的行之有效的做法。

查询范围

有几种方法可以指定要查询的实体范围。 以下是每种用途的讨论。

通常，避免进行扫描（查询要大于单个实体），但是如果必须进行扫描，请尝试组织数据，以便扫描可以检索所需的数据，而无需扫描或返回大量不需要的实体。

点查询

点查询恰好检索一个实体。 它通过指定要检索的实体的分区键和行键来做到这一点。 这些查询是有效的，您应该尽可能使用它们。

分区查询

分区查询是一种检索一组共享公共分区键的数据的查询。 通常，除了分区键之外，查询还指定行键值范围或某些实体属性的值范围。 这些效率不及点查询，因此应谨慎使用。

表查询

表查询是一种查询，它检索一组不共享公共分区键的实体。 这些查询效率不高，如果可能，应避免使用它们。

因此，“点查询正好检索一个实体”和“尽可能使用点查询”。 由于我已将数据划分为多个分区，因此它可能已作为表查询处理：“表查询是检索一组不共享公共分区键的实体的查询”。 尽管查询结合了点查询集，但它列出了所有预期实体的分区键和行键。 但是由于组合查询不是仅检索一个查询，所以不能期望它作为点查询（或点查询集）执行。

Azure表存储多行查询性能

问题描述

3 个解决方案

解决方案1
1 2019-08-07 08:36:14

解决方案2
0 2019-08-06 15:26:24

解决方案3
0 已采纳 2019-08-09 06:42:52

Azure表存储多行查询性能

问题描述

3 个解决方案

解决方案1 1 2019-08-07 08:36:14

解决方案2 0 2019-08-06 15:26:24

解决方案3 0 已采纳 2019-08-09 06:42:52

解决方案1
1 2019-08-07 08:36:14

解决方案2
0 2019-08-06 15:26:24

解决方案3
0 已采纳 2019-08-09 06:42:52