简体   繁体   中英

full table scanning using boto3 python

I'm trying to fully scan my table which contains more than 2 000 000 records on DynamoDB

Initially what i did was

import boto3
import pandas as pd
import json
from boto3.dynamodb.conditions import Key,Attr

dynamodb=boto3.resource('dynamodb')
table= dynamodb.Table('acloudapi_media_url_testing')
response = table.scan(
    FilterExpression=Attr('api_key').eq('xxxxxxxxxx'),
)
dict= response # the response is in the form of a dictionary
print(dict)

It printed out the result of more than 1000 records. But when i add the while loop below. It took so long/ couldnt complete the process .

import boto3
import pandas as pd
import json
from boto3.dynamodb.conditions import Key,Attr

dynamodb=boto3.resource('dynamodb')
table= dynamodb.Table('acloudapi_media_url_testing')
response = table.scan(
    FilterExpression=Attr('api_key').eq('graymatics_partners_ii'),
)
dict= response # the response is in the form of a dictionary
print(dict)

print(response['LastEvaluatedKey'])
while 'LastEvaluatedKey' in response:
    response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'],
                          FilterExpression=Attr('api_key').eq('xxxxxxxxxxi'))
    dict.update(response)
print(dict))

Can anyone advise me on what to do? I'm quite new to programming so my code will look quite noob

The key question is how many items available on your Dynamodb table?

I will just explain the above code especially the while loop that you have added lately.

The while loop is to scan all the items on DynamoDB table until there is no items to scan. One table scan will bring you only 1 MB of data. So, it has to be executed recursively while all the items are scanned.

while 'LastEvaluatedKey' in response:

If the total number of scanned items exceeds the maximum data set size limit of 1 MB, the scan stops and results are returned to the user as a LastEvaluatedKey value to continue the scan in a subsequent operation.

Imagine, if you have millions of items in the table, the program has to scan all the items recursively before it ends. Also, the filter criteria is applied on the scan result set.

Due to the reasons mentioned above, mostly the table scan should be avoided on big tables. The scan process is normally inefficient and it would cost you as well.

Alternate solution - Use Global Secondary Index (GSI) and Query API

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM