AWS DynamoDB 中的扫描与并行扫描？

Question

In Cloud storage system, AWS is highly demanded.在云存储系统中，AWS 的需求量很大。 Scan process need more faster.扫描过程需要更快。 So how the scan process works and which one(Scan/Parallel Scan) is better in in which situation?那么扫描过程是如何工作的，在哪种情况下哪个（扫描/并行扫描）更好？

How scan works in AWS DynamoDB?扫描如何在 AWS DynamoDB 中工作？
How parallel scan works in AWS DynamoDB? AWS DynamoDB 中的并行扫描如何工作？
Scan vs Parallel Scan in AWS DyanmoDB? AWS DyanmoDB 中的扫描与并行扫描？
When Parallel Scan will be preferred?何时首选并行扫描？
Is filter expression is applied before scan?是否在扫描前应用过滤器表达式？

Answer 1

1. How scan works in AWS DynamoDB? 1. 扫描在 AWS DynamoDB 中是如何工作的？

Ans:答：

i) Scan operation returns one or more items. i) 扫描操作返回一项或多项。

ii) By default, Scan operations proceed sequentially. ii) 默认情况下，扫描操作按顺序进行。

iii) By default, Scan uses eventually consistent reads when accessing the data in a table. iii) 默认情况下，Scan 在访问表中的数据时使用最终一致性读取。

iv) If the total number of scanned items exceeds the maximum data set size limit of 1 MB, the scan stops and results are returned to the user as a LastEvaluatedKey value to continue the scan in a subsequent operation. iv) 如果扫描的项目总数超过 1 MB 的最大数据集大小限制，则扫描停止并将结果作为 LastEvaluatedKey 值返回给用户，以在后续操作中继续扫描。

v) A Scan operation performs eventually consistent reads by default, and it can return up to 1 MB (one page) of data. v) Scan 操作默认执行最终一致性读取，最多可以返回 1 MB（一页）的数据。 Therefore, a single Scan request can consume因此，单个 Scan 请求可以消耗

(1 MB page size / 4 KB item size) / 2 (eventually consistent reads) = 128 read operations.

2. How parallel scan works in AWS DynamoDB? 2. AWS DynamoDB 中的并行扫描是如何工作的？

Ans:答：

i) For faster performance on a large table or secondary index, applications can request a parallel Scan operation. i) 为了在大表或二级索引上获得更快的性能，应用程序可以请求并行扫描操作。

ii) You can run multiple worker threads or processes in parallel. ii) 您可以并行运行多个工作线程或进程。 Each worker will be able to scan a separate segment of a table concurently with the other workers.每个工作人员将能够与其他工作人员同时扫描表的单独部分。 DynamoDB's Scan function now accepts two additional parameters: DynamoDB 的 Scan 函数现在接受两个附加参数：

TotalSegments denotes the number of workers that will access the table concurrently. TotalSegments表示将同时访问表的工作线程数。
Segment denotes the segment of table to be accessed by the calling worker. Segment表示调用者要访问的表段。

iii) The two parameters, when used together, limit the scan to a particular block of items in the table. iii) 这两个参数一起使用时，将扫描限制在表中的特定项目块。 You can also use the existing Limit parameter to control how much data is returned by an individual Scan request.您还可以使用现有的限制参数来控制单个扫描请求返回的数据量。

3. Scan vs Parallel Scan in AWS DyanmoDB? 3. AWS DyanmoDB 中的扫描与并行扫描？

Ans:答：

i) A Scan operation can only read one partition at a time. i) 一次扫描操作一次只能读取一个分区。 So parallel scan is needed for faster read on multiple partition at a time.因此需要并行扫描以一次在多个分区上更快地读取。

ii) A sequential Scan might not always be able to fully utilize the provisioned read throughput capacity. ii) 顺序扫描可能并不总是能够充分利用预配置的读取吞吐量容量。 So parallel scan is needed there.所以那里需要并行扫描。

iii) Parallel Scans, reduce your costs by up to 4x for certain types of queries and scans. iii) 并行扫描，将某些类型的查询和扫描的成本降低多达 4 倍。

4. When Parallel Scan will be preferred? 4. 什么时候会优先选择并行扫描？

Ans:答：

A parallel scan can be the right choice if the following conditions are met:如果满足以下条件，并行扫描可能是正确的选择：

The table size is 20 GB or larger.表大小为 20 GB 或更大。
The table's provisioned read throughput is not being fully utilized.表的预配置读取吞吐量未得到充分利用。
Sequential Scan operations are too slow.顺序扫描操作太慢。

5. Is filter expression is applied before scan? 5. 过滤表达式是否在扫描前应用？

Ans: No, A FilterExpression is applied after the items have already been read; Ans:不，在项目已经被读取后应用FilterExpression； the process of filtering does not consume any additional read capacity units.过滤过程不会消耗任何额外的读取容量单位。

Resource Link:资源链接：

Answer 2

Addressing the question of when a Parallel Scan should be used over a regular Scan...解决何时应在常规扫描上使用并行扫描的问题...

My experience is that a parallel scan is faster than a regular scan once you get above 2MB of data in a table, and roughly, you seem to optimise performance by running one segment per 1MB of data in the table.我的经验是，一旦您在表中获得超过 2MB 的数据，并行扫描比常规扫描更快，并且粗略地说，您似乎通过在表中每 1MB 数据运行一个段来优化性能。

I have three tables, each with on-demand provisioning.我有三个表，每个表都有按需配置。 A Tiny table containing 300 items and 70KB of data.一个包含 300 个项目和 70KB 数据的 Tiny 表。 A small table containing 1,800 items and 4MB of data.一个包含 1,800 个项目和 4MB 数据的小表。 And a large table containing 1.1 million items and 1.05GB of data.还有一个包含 110 万个项目和 1.05GB 数据的大表。

I can time a regular scan by putting this command into a shell script called scan.sh我可以通过将此命令放入名为 scan.sh 的 shell 脚本来定时定期扫描

aws dynamodb scan --table-name MyTable --select COUNT

And then execute然后执行

time scan.sh

I can time a parallel scan by replacing the command in the shell script with我可以通过将 shell 脚本中的命令替换为

aws dynamodb scan --table-name MyTable --total-segments 4 --segment 0 --select COUNT

The above command runs the scan in 4 segments, and only executes one of the 4 segments.上面的命令在 4 个段中运行扫描，并且只执行 4 个段之一。 I use DynamoDBMapper (Java SDK) in my application, and the SDK takes cares of running the different threads in parallel.我在我的应用程序中使用 DynamoDBMapper (Java SDK)，SDK 负责并行运行不同的线程。

On my tiny table, each scan took 1.4s, and running parallel scans made no difference.在我的小桌子上，每次扫描需要 1.4 秒，运行并行扫描没有任何区别。 On my small table a regular scan took 1.8s and a parallel scan was optimal with 4 segments, running in 1.4s.在我的小桌子上，常规扫描需要 1.8 秒，并行扫描是最佳的 4 段，运行时间为 1.4 秒。

The interesting result was the large table.有趣的结果是大桌子。 Here is time to execute the scan, based on the number of segments in a parallel scan:下面是执行扫描的时间，基于并行扫描中的段数：

1 segment - 120 seconds 1 段 - 120 秒
4 segments 30 seconds 4 段 30 秒
8 segments 15 seconds 8段15秒
16 segments 8 seconds 16段8秒
32 segments 5 seconds 32段5秒
64 segments 3 seconds 64 段 3 秒
128 segments 1.9s 128段1.9s
256 segments 1.6s 256段1.6s
512 segments - 1.4s 512 段 - 1.4 秒
1024 segments - 1.4s 1024 段 - 1.4 秒

AWS DynamoDB 中的扫描与并行扫描？

问题描述

2 个解决方案

解决方案1
16 已采纳 2016-12-21 17:41:17

Resource Link:资源链接：

解决方案2
3 2020-01-15 18:46:28

AWS DynamoDB 中的扫描与并行扫描？

问题描述

2 个解决方案

解决方案1 16 已采纳 2016-12-21 17:41:17

Resource Link:资源链接：

解决方案2 3 2020-01-15 18:46:28

解决方案1
16 已采纳 2016-12-21 17:41:17

解决方案2
3 2020-01-15 18:46:28