简体   繁体   English

DynamoDB扫描查询和BatchGet

[英]DynamoDB Scan Query and BatchGet

We have a Dynamo DB table structure which consists Hash and Range as primary key. 我们有一个Dynamo DB表结构,其中包含哈希和范围作为主键。

Hash = date.random_number
Range = timestamp

How to get items within X and Y timestamp? 如何在X和Y时间戳内获取项目? Since hash key is attached with random_number, those many times query has to be fired. 由于哈希键附加有random_number,因此必须多次触发查询。 Is it possible to give multiple hash values and single RangeKeyCondition. 是否可以提供多个哈希值和单个RangeKeyCondition。

What would be most efficient in terms of cost and time? 在成本和时间方面最有效的是什么?

Random number range is from 1 to 10. 随机数范围是1到10。

If I understood correctly, you have a table with the following definition of Primary Keys: 如果我理解正确,那么您的表中的主键定义如下:

Hash Key  : date.random_number 
Range Key : timestamp

One thing that you have to keep in mind is that , whether you are using GetItem or Query , you have to be able to calculate the Hash Key in your application in order to successfully retrieve one or more items from your table. 您必须记住的一件事是,无论您使用的是GetItem还是Query ,都必须能够计算应用程序中的Hash Key ,才能成功从表中检索一个或多个项目。

It makes sense to use the random numbers as part of your Hash Key so your records can be evenly distributed across the DynamoDB partitions, however, you have to do it in a way that your application can still calculate those numbers when you need to retrieve the records. 将随机数用作Hash Key一部分很有意义,因此您的记录可以均匀地分布在DynamoDB分区上,但是,您必须以一种方式使应用程序仍然可以在需要检索这些数字时计算这些数字。记录。

With that in mind, let's create the query needed for the specified requirements. 考虑到这一点,让我们创建指定需求所需的查询。 The native AWS DynamoDB operations that you have available to obtain several items from your table are: 可用于从表中获取多个项目的本机AWS DynamoDB操作是:

Query, BatchGetItem and Scan
  • In order to use BatchGetItem you would need to know beforehand the entire primary key (Hash Key and Range Key), which is not the case. 为了使用BatchGetItem您需要事先知道整个主键(哈希键和范围键),事实并非如此。

  • The Scan operation will literally go through every record of your table, something that in my opinion is unnecessary for your requirements. Scan操作实际上会遍历表的每条记录,我认为这对于您的要求是不必要的。

  • Lastly, the Query operation allows you to retrieve one or more items from a table applying the EQ (equality) operator to the Hash Key and a number of other operators that you can use when you don't have the entire Range Key or would like to match more than one. 最后,使用Query操作,您可以从表中检索一个或多个项目,这些表将EQ (等式)运算符应用于Hash Key以及当您没有整个Range Key或需要时可以使用的许多其他运算符匹配多个。

The operator options for the Range Key condition are: EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN Range Key条件的操作员选项为: EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN

It seems to me that the most suitable for your requirements is the BETWEEN operator, that being said, let's see how you could build the query with the chosen SDK: 在我看来,最适合您要求的是BETWEEN运算符,也就是说,让我们看看如何使用所选的SDK构建查询:

Table table = dynamoDB.getTable(tableName);

String hashKey = "<YOUR_COMPUTED_HASH_KEY>";
String timestampX = "<YOUR_TIMESTAMP_X_VALUE>";
String timestampY = "<YOUR_TIMESTAMP_Y_VALUE>";

RangeKeyCondition rangeKeyCondition = new RangeKeyCondition("RangeKeyAttributeName").between(timestampX, timestampY);

        ItemCollection<QueryOutcome> items = table.query("HashKeyAttributeName", hashKey,
            rangeKeyCondition,
            null, //FilterExpression - not used in this example
            null,  //ProjectionExpression - not used in this example
            null, //ExpressionAttributeNames - not used in this example
            null); //ExpressionAttributeValues - not used in this example

You might want to look at the following post to get more information about DynamoDB Primary Keys: DynamoDB: When to use what PK type? 您可能希望查看以下文章,以获取有关DynamoDB主键的更多信息: DynamoDB:何时使用哪种PK类型?

QUESTION: My concern is querying multiple times because of random_number attached to it. 问题:由于附加了random_number,我担心的是多次查询。 Is there a way to combine these queries and hit dynamoDB once ? 有没有一种方法可以将这些查询组合在一起,并一次命中dynamoDB?

Your concern is completely understandable, however, the only way to fetch all the records via BatchGetItem is by knowing the entire primary key (HASH + RANGE) of all records you intend to get. 您的担心是完全可以理解的,但是,通过BatchGetItem获取所有记录的唯一方法是知道要获取的所有记录的整个主键(HASH + RANGE)。 Although minimizing the HTTP roundtrips to the server might seem to be the best solution at first sight, the documentation actually suggests to do exactly what you are doing to avoid hot partitions and uneven use of your provisioned throughput: 虽然最大程度地减少到服务器的HTTP往返传输乍看起来似乎是最好的解决方案,但该文档实际上建议您按照自己的方式进行操作,以避免热分区和配置吞吐量的不均匀使用:

Design For Uniform Data Access Across Items In Your Tables 设计跨表中项目的统一数据访问

"Because you are randomizing the hash key, the writes to the table on each day are spread evenly across all of the hash key values; this will yield better parallelism and higher overall throughput. [...] To read all of the items for a given day, you would still need to Query each of the 2014-07-09.N keys (where N is 1 to 200), and your application would need to merge all of the results. However, you will avoid having a single "hot" hash key taking all of the workload." “由于您是对哈希键进行随机化,因此每天对表的写入会平均分布在所有哈希键值上;这将产生更好的并行性和更高的整体吞吐量。在给定的一天,您仍然需要查询每个2014-07-09.N键(其中N是1到200),并且您的应用程序需要合并所有结果。但是,您将避免只使用一个键“热”哈希键承担所有工作量。”

Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html 来源: http//docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Here there is another interesting point suggesting the moderate use of reads in a single partition... if you remove the random number from the hash key to be able to get all records in one shot, you are likely to fall on this issue, regardless if you are using Scan , Query or BatchGetItem : 这里还有一个有趣的观点,建议在单个分区中适当地使用读取...如果您从哈希键中删除随机数,以便一次就能获得所有记录,那么无论如何,您都可能会遇到这个问题如果您使用ScanQueryBatchGetItem

Guidelines for Query and Scan - Avoid Sudden Bursts of Read Activity 查询和扫描准则-避免突发的读取活动

"Note that it is not just the burst of capacity units the Scan uses that is a problem. It is also because the scan is likely to consume all of its capacity units from the same partition because the scan requests read items that are next to each other on the partition. This means that the request is hitting the same partition, causing all of its capacity units to be consumed, and throttling other requests to that partition. If the request to read data had been spread across multiple partitions, then the operation would not have throttled a specific partition." “请注意,这不仅是扫描使用的容量单位爆发问题,而且还因为扫描可能会消耗同一分区中的所有容量单位,因为扫描请求会读取每个容量单位旁边的项目这意味着该请求正在命中同一分区,导致其所有容量单位被消耗,并限制了对该分区的其他请求。如果读取数据的请求已分散在多个分区中,则该操作不会限制特定的分区。”

And lastly, because you are working with time series data, it might be helpful to look into some best practices suggested by the documentation as well: 最后,由于您正在使用时间序列数据,因此研究文档建议的一些最佳实践也可能会有所帮助:

Understand Access Patterns for Time Series Data 了解时间序列数据的访问模式

For each table that you create, you specify the throughput requirements. 对于您创建的每个表,您指定吞吐量要求。 DynamoDB allocates and reserves resources to handle your throughput requirements with sustained low latency. DynamoDB分配和保留资源,以持续的低延迟处理您的吞吐量需求。 When you design your application and tables, you should consider your application's access pattern to make the most efficient use of your table's resources. 在设计应用程序和表时,应考虑应用程序的访问模式,以最有效地利用表资源。

Suppose you design a table to track customer behavior on your site, such as URLs that they click. 假设您设计一个表来跟踪客户在您网站上的行为,例如他们单击的URL。 You might design the table with hash and range type primary key with Customer ID as the hash attribute and date/time as the range attribute. 您可以将表设计为具有哈希和范围类型主键,其中客户ID为哈希属性,日期/时间为范围属性。 In this application, customer data grows indefinitely over time; 在此应用程序中,客户数据会随着时间无限增长。 however, the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. 但是,应用程序对表中所有项目的访问方式可能显示不均匀,其中最新客户数据更相关,并且您的应用程序可能更频繁地访问最新项目,并且随着时间的流逝这些项目被访问的次数减少,最终,较旧的项目很少访问。 If this is a known access pattern, you could take it into consideration when designing your table schema. 如果这是已知的访问模式,则可以在设计表架构时将其考虑在内。 Instead of storing all items in a single table, you could use multiple tables to store these items. 您可以使用多个表来存储这些项目,而不是将所有项目存储在一个表中。 For example, you could create tables to store monthly or weekly data. 例如,您可以创建表来存储每月或每周数据。 For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources. 对于存储最近一个月或一周中数据访问率较高,要求更高吞吐量的表,对于存储较旧数据的表,您可以降低吞吐量并节省资源。

You can save on resources by storing "hot" items in one table with higher throughput settings, and "cold" items in another table with lower throughput settings. 您可以通过在吞吐量设置较高的一个表中存储“热”项,在吞吐量设置较低的另一表中存储“冷”项来节省资源。 You can remove old items by simply deleting the tables. 您只需删除表即可删除旧项目。 You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). 您可以选择将这些表备份到其他存储选项,例如Amazon Simple Storage Service(Amazon S3)。 Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations. 删除整个表比逐个删除项要有效得多,因为删除操作与放置操作一样多,这实际上使写入吞吐量增加了一倍。

Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html 来源: http//docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM