简体繁体 English

在没有全表扫描的情况下在 Dynamo DB 中随机抽样大小 N

[英]Random Sampling of size N in Dynamo DB without full Table scan

原文 2021-08-18 05:22:12 1 2 amazon-dynamodb/ dynamodb-queries/ amazon-dynamodb-streams/ amazon-dynamodb-dax

I am new to dynamodb & was having some trouble in finding a way to randomly getting items without a full table scan,most of the algorithms that i found consist of full table scans I am also taking the case where we don't have additional information of the table(Like columns and column Type such info is unknown) Is there a way exist to do so我是 dynamodb 的新手，在寻找一种无需全表扫描的情况下随机获取项目的方法时遇到了一些麻烦，我发现的大多数算法都包含全表扫描我也在考虑我们没有其他信息的情况表的（像列和列类型这样的信息是未知的）有没有办法这样做

2 个解决方案

You can randomly sample by using a randomly generated exclusive start key for the scan or query operation.您可以通过使用随机生成的独占开始密钥进行扫描或查询操作来随机采样。 The exclusive start key does not have to match a record in the table.独占开始键不必匹配表中的记录。 It just needs to follow the key structure of the table/index.它只需要遵循表/索引的关键结构。

As with most questions about queries in DynamoDB, how you structure your data depends on how you want to query it.与有关 DynamoDB 中的查询的大多数问题一样，您如何构建数据取决于您希望如何查询它。

For something like a random sampling, you have to make it confirm to the following core constraint of DynamoDB:对于像随机抽样这样的事情，你必须让它确认 DynamoDB 的以下核心约束：

You have to provide a partition key您必须提供分区键
You can provide a sort key您可以提供排序键

So with a "single table" type design, you could structure your data something like this:因此，使用“单表”类型设计，您可以像这样构建数据：

PK PK	SK SK	myVal我的值
my_dict我的字典	6caaf1e3-eb8d-404a-a2ae-97d6682b0224 6caaf1e3-eb8d-404a-a2ae-97d6682b0224	foo富
my_dict我的字典	1c5496e8-c660-4b4e-980f-4abfb1942863 1c5496e8-c660-4b4e-980f-4abfb1942863	bar酒吧
my_dict我的字典	56551340-fff8-4824-a5be-70fcaece2e1a 56551340-fff8-4824-a5be-70fcaece2e1a	baz巴兹
my_other_dict我的_other_dict	520a7b37-233c-49dd-87da-77d871d98c92 520a7b37-233c-49dd-87da-77d871d98c92	test1测试1
my_other_dict我的_other_dict	65ccd54e-72c3-499d-a3a7-0cd989252607 65ccd54e-72c3-499d-a3a7-0cd989252607	test2测试2

The PK is the identifier for your collection of random things to look up. PK 是您要查找的随机事物集合的标识符。 The SK is a random UUID. SK 是一个随机的 UUID。 And myVal contains the value you want to be returned. myVal包含您要返回的值。

You can query this db the following way:您可以通过以下方式查询此数据库：

SELECT * FROM "my-table" WHERE PK = 'my_dict' AND SK < '06a04e20-b239-48f2-a205-552eb61fef35'

By querying with an UUID as the SK, you'll get the first item in the table with an UUID close to the one you query for.通过使用 UUID 作为 SK 进行查询，您将获得表中第一个 UUID 与您查询的 UUID 接近的项目。 By using a random uuid each time you query, you'll get a random result back.通过在每次查询时使用随机 uuid，您将得到一个随机结果。

The particular query above actually returns nothing, so you need to retry until you get a result.上面的特定查询实际上没有返回任何内容，因此您需要重试直到获得结果。

Also, I haven't done the math (who has?), but I'd imagine that periodic queries like this won't generate perfectly random distributions, especially for small data sets.此外，我还没有计算过（谁计算过？），但我认为像这样的定期查询不会生成完全随机的分布，尤其是对于小数据集。