简体   繁体   中英

Odd behavior with begins_with and a binary column in DynamoDB

Summary

When querying a binary range key using begins_with , some results are not returned even though they begin with the value being queried. This appears to only happen with certain values, and only in DynamoDB-local - not the AWS hosted version of DynamoDB.

Here is a gist you can run that reproduces the issue: https://gist.github.com/pbaughman/922db7b51f7f82bbd9634949d71f846b

Details

I have a DynamoDB table with the following schema:

user_id - Primary Key - binary - Contains 16 byte UUID
project_id_item_id - Sort Key - binary - 32 bytes - two UUIDs concatinated

While running my unit tests locally using the dynamodb-local docker image I have observed some bizarre behavior

I've inserted 20 items into my table like this:

table.put_item(
    Item={
        'user_id': user_id.bytes,
        'project_id_item_id': project_id.bytes + item_id.bytes
    }
)              

Each item has the same user_id and the same project_id with a different item_id .

When I attempt to query the same data back out, sometimes (maybe 1 in 5 times that I run the test) I only get some of the items back out:

table.query(
    KeyConditionExpression=
        Key('user_id').eq(user_id.bytes) &
        Key('project_id_item_id').begins_with(project_id.bytes))
)
# Only returns 14 items

If I drop the 2nd condition from the KeyConditionExpression, I get all 20 items.

If I run a scan instead of a query and use the same condition expression, I get all 20 items

table.scan(
    FilterExpression=
        Key('user_id').eq(user_id.bytes) &
        Key('project_id_item_id').begins_with(project_id.bytes))
)
# 20 items are returned

If I print the project_id_item_id of every item in the table, I can see that they all start with the same project_id:

[i['project_id_item_id'].value.hex() for i in table.scan()['Items']]

# Result:
  |---------Project Id-----------|
['76761923aeba4edf9fccb9eeb5f80cc40604481b26c84c73b63308dd588a4df1',
 '76761923aeba4edf9fccb9eeb5f80cc40ec926452c294c909befa772b86e2175',
 '76761923aeba4edf9fccb9eeb5f80cc460ff943b36ec44518175525d6eb30480',
 '76761923aeba4edf9fccb9eeb5f80cc464e427afe84d49a5b3f890f9d25ee73b',
 '76761923aeba4edf9fccb9eeb5f80cc466f3bfd77b14479a8977d91af1a5fa01',
 '76761923aeba4edf9fccb9eeb5f80cc46cd5b7dec9514714918449f8b49cbe4e',
 '76761923aeba4edf9fccb9eeb5f80cc47d89f44aae584c1c9da475392cb0a085',
 '76761923aeba4edf9fccb9eeb5f80cc495f85af4d1f142608fae72e23f54cbfb',
 '76761923aeba4edf9fccb9eeb5f80cc496374432375a498b937dec3177d95c1a',
 '76761923aeba4edf9fccb9eeb5f80cc49eba93584f964d13b09fdd7866a5e382',
 '76761923aeba4edf9fccb9eeb5f80cc4a6086f1362224115b7376bc5a5ce66b8',
 '76761923aeba4edf9fccb9eeb5f80cc4b5c6872aa1a84994b6f694666288b446',
 '76761923aeba4edf9fccb9eeb5f80cc4be07cd547d804be4973041cfd1529734',
 '76761923aeba4edf9fccb9eeb5f80cc4c48daab011c449f993f061da3746a660',
 '76761923aeba4edf9fccb9eeb5f80cc4d09bc44973654f39b95a91eb3e291c68',
 '76761923aeba4edf9fccb9eeb5f80cc4d0edda3d8c6643ad8e93afe2f1b518d4',
 '76761923aeba4edf9fccb9eeb5f80cc4d8d1f6f4a85e47d78e2d06ec1938ee2a',
 '76761923aeba4edf9fccb9eeb5f80cc4dc7323adfa35423fba15f77facb9a41b',
 '76761923aeba4edf9fccb9eeb5f80cc4f948fb40873b425aa644f220cdcb5d4b',
 '76761923aeba4edf9fccb9eeb5f80cc4fc7f0583f593454d92a8a266a93c6fcd']

As a sanity check, here is the project_id I'm using in my query:

print(project_id)
76761923-aeba-4edf-9fcc-b9eeb5f80cc4  # Matches what's returned by scan above

Finally, the most bizarre part is I can try to match fewer bytes of the project ID and I start to see all 20 items, then zero items, then all 20 items again:

hash_key = Key('hash_key').eq(hash_key)
for n in range(1,17):
    short_key = project_id.bytes[:n]
    range_key = Key('project_id_item_id').begins_with(short_key)
    count = table.query(KeyConditionExpression=hash_key & range_key)['Count']
    print("If I only query for 0x{:32} I find {} items".format(short_key.hex(), count))

Gets me:

If I only query for 0x76                               I find 20 items
If I only query for 0x7676                             I find 20 items
If I only query for 0x767619                           I find 20 items
If I only query for 0x76761923                         I find 20 items
If I only query for 0x76761923ae                       I find 20 items
If I only query for 0x76761923aeba                     I find 20 items
If I only query for 0x76761923aeba4e                   I find 20 items
If I only query for 0x76761923aeba4edf                 I find 0 items
If I only query for 0x76761923aeba4edf9f               I find 20 items
If I only query for 0x76761923aeba4edf9fcc             I find 0 items
If I only query for 0x76761923aeba4edf9fccb9           I find 20 items
If I only query for 0x76761923aeba4edf9fccb9ee         I find 0 items
If I only query for 0x76761923aeba4edf9fccb9eeb5       I find 20 items
If I only query for 0x76761923aeba4edf9fccb9eeb5f8     I find 20 items
If I only query for 0x76761923aeba4edf9fccb9eeb5f80c   I find 20 items
If I only query for 0x76761923aeba4edf9fccb9eeb5f80cc4 I find 15 items

I am totally dumbfounded by this pattern. If the range key I'm searching for is 8, 10 or 12 bytes long I get no matches. If it's 16 bytes long I get fewer than 20 but more than 0 matches.

Does anybody have any idea what could be going on here? The documentation indicates that the begins_with expression works with Binary data. I'm totally at a loss as to what could be going wrong. I wonder if DynamoDB-local is doing something like converting the binary data to strings internally to do the comparisons and some of these binary patterns don't convert correctly.

It seems like it might be related to the project_id UUID. If I hard-code it to 76761923-aeba-4edf-9fcc-b9eeb5f80cc4 in the test, I can make it miss items every time.

This may be a six year old bug in DynamoDB local I will leave this question open in case someone has more insight, and I will update this answer if I'm able to find out more information from Amazon.

Edit: As of June 23rd, they have managed to reproduce the issue and it is in the queue to be fixed in a future release.

2nd Edit: As of August 4th, they are investigating the issue and a fix will be released shortly

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM