迭代DynamoDB表中的所有项目

Question

I'm trying to iterate through all items in my DynamoDB table. 我正在尝试遍历DynamoDB表中的所有项目。 (I understand this is an inefficient process but am doing this one-time to build an index table.) （我知道这是一个效率低下的过程，但我这样做是为了构建一个索引表。）

I understand that DynamoDB's scan() function returns the lesser of 1MB or a supplied limit. 据我所知，DynamoDB的scan（）函数返回1MB或提供的限制中的较小者。 To compensate for this, I wrote a function that looks for the "LastEvaluatedKey" result and re-queries starting from the LastEvaluatedKey to get all the results. 为了弥补这一点，我编写了一个查找“LastEvaluatedKey”结果的函数，并从LastEvaluatedKey开始重新查询以获取所有结果。

Unfortunately, it seems like every time my function loops, every single key in the entire database is scanned, quickly eating up my allocated read units. 不幸的是，似乎每次我的函数循环，整个数据库中的每一个键都被扫描，快速耗尽我分配的读取单位。 It's extremely slow. 这非常慢。

Here is my code: 这是我的代码：

def search(table, scan_filter=None, range_key=None,
           attributes_to_get=None,
           limit=None):
    """ Scan a database for values and return
        a dict.
    """

    start_key = None
    num_results = 0
    total_results = []
    loop_iterations = 0
    request_limit = limit

    while num_results < limit:
        results = self.conn.layer1.scan(table_name=table,
                                  attributes_to_get=attributes_to_get,
                                  exclusive_start_key=start_key,
                                  limit=request_limit)
        num_results = num_results + len(results['Items'])
        start_key = results['LastEvaluatedKey']
        total_results = total_results + results['Items']
        loop_iterations = loop_iterations + 1
        request_limit = request_limit - results['Count']

        print "Count: " + str(results['Count'])
        print "Scanned Count: " + str(results['ScannedCount'])
        print "Last Evaluated Key: " + str(results['LastEvaluatedKey']['HashKeyElement']['S'])
        print "Capacity: " + str(results['ConsumedCapacityUnits'])
        print "Loop Iterations: " + str(loop_iterations)

    return total_results

Calling the function: 调用函数：

db = DB()
results = db.search(table='media',limit=500,attributes_to_get=['id'])

And my output: 我的输出：

Count: 96
Scanned Count: 96
Last Evaluated Key: kBR23QJNAwYZZxF4E3N1crQuaTwjIeFfjIv8NyimI9o
Capacity: 517.5
Loop Iterations: 1
Count: 109
Scanned Count: 109
Last Evaluated Key: ATcJFKfY62NIjTYY24Z95Bd7xgeA1PLXAw3gH0KvUjY
Capacity: 516.5
Loop Iterations: 2
Count: 104
Scanned Count: 104
Last Evaluated Key: Lm3nHyW1KMXtMXNtOSpAi654DSpdwV7dnzezAxApAJg
Capacity: 516.0
Loop Iterations: 3
Count: 104
Scanned Count: 104
Last Evaluated Key: iirRBTPv9xDcqUVOAbntrmYB0PDRmn5MCDxdA6Nlpds
Capacity: 513.0
Loop Iterations: 4
Count: 100
Scanned Count: 100
Last Evaluated Key: nBUc1LHlPPELGifGuTSqPNfBxF9umymKjCCp7A7XWXY
Capacity: 516.5
Loop Iterations: 5

Is this expected behavior? 这是预期的行为吗？ Or, what am I doing wrong? 或者，我做错了什么？

Answer 1

Short answer 简短的回答

You are not doing anything wrong 你没有做错任何事

Long answer 答案很长

This is closely related to the way Amazon computes the capacity unit. 这与亚马逊计算容量单位的方式密切相关。 First, it is extremely important to understand that: 首先，了解以下内容非常重要：

capacity units == reserved computational units
capacity units != reserved network transit

Well, even that is not strictly speaking exact but quite close, especially when it comes to Scan . 好吧，即使这并不是严格来说确切但非常接近，尤其是涉及到Scan 。

During a Scan operation, there is a fundamental distinction between 在Scan操作期间，存在根本区别

scanned Items : cumulated size is at most 1MB, may be below that size if limit is already reached 扫描项目 ：累计大小最多为 1MB，如果已达到limit ，则可能低于该大小
returned Items : all the matching items in the scanned Items 返回项目 ：已扫描项目中的所有匹配项目

as the capacity unit is a compute unit, you pay for the scanned Items . 由于capacity unit是计算单位，您需要为扫描的项目 付费。 Well, actually, you pay for the cumulated size of the scanned items. 实际上，您需要支付扫描项目的累积大小。 Beware that this size includes all the storage and index overhead... 0.5 capacity / cumulated KB 请注意，此大小包括所有存储和索引开销... 0.5 capacity / cumulated KB

The scanned size does not depend on any filter, be it a field selector or a result filter. 扫描的大小不依赖于任何过滤器，无论是字段选择器还是结果过滤器。

From your results, I guess that your Items requires ~10KB each which your comment on their actual payload size tends to confirm. 根据您的结果，我猜您的项目每个需要大约10KB，您对其实际有效负载大小的评论往往会得到确认。

Another example 另一个例子

I have a test table which contains only very small elements. 我有一个测试表，其中只包含非常小的元素。 A Scan consumes only 1.0 Capacity unit to retrieve 100 Items because cumulated size < 2KB 扫描仅消耗1.0容量单位以检索100个项目，因为cumulated size < 2KB

迭代DynamoDB表中的所有项目

问题描述

1 个解决方案

解决方案1
4 已采纳 2012-08-30 19:19:12

Short answer 简短的回答

Long answer 答案很长

Another example 另一个例子

迭代DynamoDB表中的所有项目

问题描述

1 个解决方案

解决方案1 4 已采纳 2012-08-30 19:19:12

Short answer 简短的回答

Long answer 答案很长

Another example 另一个例子

解决方案1
4 已采纳 2012-08-30 19:19:12