简体   繁体   English

begin_with 和 DynamoDB 中的二进制列的奇怪行为

[英]Odd behavior with begins_with and a binary column in DynamoDB

Summary概括

When querying a binary range key using begins_with , some results are not returned even though they begin with the value being queried.使用begins_with查询二进制范围键时,即使某些结果以所查询的值开头,也不会返回。 This appears to only happen with certain values, and only in DynamoDB-local - not the AWS hosted version of DynamoDB.这似乎只发生在某些值上,并且只发生在 DynamoDB-local - 而不是 AWS 托管的 DynamoDB 版本中。

Here is a gist you can run that reproduces the issue: https://gist.github.com/pbaughman/922db7b51f7f82bbd9634949d71f846b您可以运行以下要点来重现该问题: https://gist.github.com/pbaughman/922db7b51f7f82bbd9634949d71f846b

Details细节

I have a DynamoDB table with the following schema:我有一个具有以下架构的 DynamoDB 表:

user_id - Primary Key - binary - Contains 16 byte UUID
project_id_item_id - Sort Key - binary - 32 bytes - two UUIDs concatinated

While running my unit tests locally using the dynamodb-local docker image I have observed some bizarre behavior在使用dynamodb-local docker 图像在本地运行我的单元测试时,我观察到一些奇怪的行为

I've inserted 20 items into my table like this:我在我的表中插入了 20 个项目,如下所示:

table.put_item(
    Item={
        'user_id': user_id.bytes,
        'project_id_item_id': project_id.bytes + item_id.bytes
    }
)              

Each item has the same user_id and the same project_id with a different item_id .每个项目都有相同的user_id和相同的project_id和不同的item_id

When I attempt to query the same data back out, sometimes (maybe 1 in 5 times that I run the test) I only get some of the items back out:当我尝试查询相同的数据时,有时(可能是我运行测试的五分之一)我只得到一些项目:

table.query(
    KeyConditionExpression=
        Key('user_id').eq(user_id.bytes) &
        Key('project_id_item_id').begins_with(project_id.bytes))
)
# Only returns 14 items

If I drop the 2nd condition from the KeyConditionExpression, I get all 20 items.如果我从 KeyConditionExpression 中删除第二个条件,我会得到所有 20 个项目。

If I run a scan instead of a query and use the same condition expression, I get all 20 items如果我运行扫描而不是查询并使用相同的条件表达式,我会得到所有 20 个项目

table.scan(
    FilterExpression=
        Key('user_id').eq(user_id.bytes) &
        Key('project_id_item_id').begins_with(project_id.bytes))
)
# 20 items are returned

If I print the project_id_item_id of every item in the table, I can see that they all start with the same project_id:如果我打印表中每个项目的 project_id_item_id,我可以看到它们都以相同的 project_id 开头:

[i['project_id_item_id'].value.hex() for i in table.scan()['Items']]

# Result:
  |---------Project Id-----------|
['76761923aeba4edf9fccb9eeb5f80cc40604481b26c84c73b63308dd588a4df1',
 '76761923aeba4edf9fccb9eeb5f80cc40ec926452c294c909befa772b86e2175',
 '76761923aeba4edf9fccb9eeb5f80cc460ff943b36ec44518175525d6eb30480',
 '76761923aeba4edf9fccb9eeb5f80cc464e427afe84d49a5b3f890f9d25ee73b',
 '76761923aeba4edf9fccb9eeb5f80cc466f3bfd77b14479a8977d91af1a5fa01',
 '76761923aeba4edf9fccb9eeb5f80cc46cd5b7dec9514714918449f8b49cbe4e',
 '76761923aeba4edf9fccb9eeb5f80cc47d89f44aae584c1c9da475392cb0a085',
 '76761923aeba4edf9fccb9eeb5f80cc495f85af4d1f142608fae72e23f54cbfb',
 '76761923aeba4edf9fccb9eeb5f80cc496374432375a498b937dec3177d95c1a',
 '76761923aeba4edf9fccb9eeb5f80cc49eba93584f964d13b09fdd7866a5e382',
 '76761923aeba4edf9fccb9eeb5f80cc4a6086f1362224115b7376bc5a5ce66b8',
 '76761923aeba4edf9fccb9eeb5f80cc4b5c6872aa1a84994b6f694666288b446',
 '76761923aeba4edf9fccb9eeb5f80cc4be07cd547d804be4973041cfd1529734',
 '76761923aeba4edf9fccb9eeb5f80cc4c48daab011c449f993f061da3746a660',
 '76761923aeba4edf9fccb9eeb5f80cc4d09bc44973654f39b95a91eb3e291c68',
 '76761923aeba4edf9fccb9eeb5f80cc4d0edda3d8c6643ad8e93afe2f1b518d4',
 '76761923aeba4edf9fccb9eeb5f80cc4d8d1f6f4a85e47d78e2d06ec1938ee2a',
 '76761923aeba4edf9fccb9eeb5f80cc4dc7323adfa35423fba15f77facb9a41b',
 '76761923aeba4edf9fccb9eeb5f80cc4f948fb40873b425aa644f220cdcb5d4b',
 '76761923aeba4edf9fccb9eeb5f80cc4fc7f0583f593454d92a8a266a93c6fcd']

As a sanity check, here is the project_id I'm using in my query:作为健全性检查,这是我在查询中使用的 project_id:

print(project_id)
76761923-aeba-4edf-9fcc-b9eeb5f80cc4  # Matches what's returned by scan above

Finally, the most bizarre part is I can try to match fewer bytes of the project ID and I start to see all 20 items, then zero items, then all 20 items again:最后,最奇怪的部分是我可以尝试匹配项目 ID 的更少字节,然后我开始看到所有 20 个项目,然后是零个项目,然后又是所有 20 个项目:

hash_key = Key('hash_key').eq(hash_key)
for n in range(1,17):
    short_key = project_id.bytes[:n]
    range_key = Key('project_id_item_id').begins_with(short_key)
    count = table.query(KeyConditionExpression=hash_key & range_key)['Count']
    print("If I only query for 0x{:32} I find {} items".format(short_key.hex(), count))

Gets me:让我:

If I only query for 0x76                               I find 20 items
If I only query for 0x7676                             I find 20 items
If I only query for 0x767619                           I find 20 items
If I only query for 0x76761923                         I find 20 items
If I only query for 0x76761923ae                       I find 20 items
If I only query for 0x76761923aeba                     I find 20 items
If I only query for 0x76761923aeba4e                   I find 20 items
If I only query for 0x76761923aeba4edf                 I find 0 items
If I only query for 0x76761923aeba4edf9f               I find 20 items
If I only query for 0x76761923aeba4edf9fcc             I find 0 items
If I only query for 0x76761923aeba4edf9fccb9           I find 20 items
If I only query for 0x76761923aeba4edf9fccb9ee         I find 0 items
If I only query for 0x76761923aeba4edf9fccb9eeb5       I find 20 items
If I only query for 0x76761923aeba4edf9fccb9eeb5f8     I find 20 items
If I only query for 0x76761923aeba4edf9fccb9eeb5f80c   I find 20 items
If I only query for 0x76761923aeba4edf9fccb9eeb5f80cc4 I find 15 items

I am totally dumbfounded by this pattern.我完全被这种模式惊呆了。 If the range key I'm searching for is 8, 10 or 12 bytes long I get no matches.如果我要搜索的范围键是 8、10 或 12 个字节长,我将找不到匹配项。 If it's 16 bytes long I get fewer than 20 but more than 0 matches.如果它的长度为 16 个字节,我得到的匹配数少于 20 个但多于 0 个。

Does anybody have any idea what could be going on here?有人知道这里会发生什么吗? The documentation indicates that the begins_with expression works with Binary data. 文档表明 begin_with 表达式适用于二进制数据。 I'm totally at a loss as to what could be going wrong.我完全不知道可能出了什么问题。 I wonder if DynamoDB-local is doing something like converting the binary data to strings internally to do the comparisons and some of these binary patterns don't convert correctly.我想知道 DynamoDB-local 是否正在做一些事情,比如在内部将二进制数据转换为字符串以进行比较,并且其中一些二进制模式不能正确转换。

It seems like it might be related to the project_id UUID.似乎它可能与 project_id UUID 有关。 If I hard-code it to 76761923-aeba-4edf-9fcc-b9eeb5f80cc4 in the test, I can make it miss items every time.如果我在测试中将其硬编码为76761923-aeba-4edf-9fcc-b9eeb5f80cc4 ,我可以让它每次都错过项目。

This may be a six year old bug in DynamoDB local I will leave this question open in case someone has more insight, and I will update this answer if I'm able to find out more information from Amazon.这可能是本地 DynamoDB 中存在六年的错误,如果有人有更多见解,我将保留这个问题,如果我能够从 Amazon 找到更多信息,我会更新这个答案。

Edit: As of June 23rd, they have managed to reproduce the issue and it is in the queue to be fixed in a future release.编辑:截至 6 月 23 日,他们已设法重现该问题,并且正在等待在未来版本中修复。

2nd Edit: As of August 4th, they are investigating the issue and a fix will be released shortly第二次编辑:截至 8 月 4 日,他们正在调查该问题,并将很快发布修复程序

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM