简体   繁体   中英

How to trigger lambda for each item in a huge static dynamodb table

I have a dynamodb table with nearly 200k items. I need to trigger a lambda for each item in it (send each item to lambda as input). I want to perform this for every x hours for all the items in the table. Data in table changes every 5 days or so.

Is there a way server-less way to automate fetching all the items to lambda via SQS, etc?

I cannot have a lambda to scan the entire table since it is too much for a lambda to handle it (given 300 seconds limit, etc).

Thanks, Vinod.

Both scan and changing all the data in Dynamodb are not feasible.

You can keep all the dynamodb keys in a cache like redis, There can be a separate job who takes keys from redis and puts to sqs, where the lambda is listening. The redis keys can be kept up to date using dynamodb streams.

Dynamo doesn't offer a good way to trigger a lambda for items that already exist, there are a few ways however you could approach this problem:

Option 1 (Scan in lambda in small batches):

You mentioned that you where concerned with the lambda not having enough resources to scan all of the items in the table you could try operating on the data in smaller chunks to avoid hitting resource limitations. Lambdas have a max execution time of 15 minutes witch should be enough for most jobs. (Please note that in Lambda CPU scales with Memory so depending on the job over provisioning memory could actually save you money by reducing the time the function takes to complete.)

Option 2 (Scheduled ECS Task Fargate):

In ECS using Fargate you can serverlessly create tasks on a cron schedule. If you are worried about resource limits you can provision up to 4 vCPU and 32GB of memory per task, which will make it far less likely you will hit the resource limit. Here is some documentation on how to set that up.

Option 3 (Process items using Dynamo triggers):

You could can configure your dynamo table to trigger a lambda whenever data in the table is Inserted , Modified , or Removed , you can then process items as they come in or as they change.You can even configure it to batch changes up to 10 items to reduce lambda invocations. Here is a link to the documentation.

Note: This Method doesn't trigger for items already in the table. However you can get around this by writing a script to update a arbitrary field on those items.

When you say you want to "trigger" for each item in it it's not 100% clear what you mean. In general I think DynamoDB streams for that, but something has to cause the records to be processed by the stream. That is often done with a simple UpdateItem to each record, setting a field, that likely isn't part of your data, to something like the current time, or something else unique. From there you will get each record to process through a lambda triggered on the stream.

The 100% serverless way to loop through the data is the following:

  • Step function that calls a lambda to run through a batch of records.
  • Lambda function scans the records, looping through the pages of records.
    • Function should accept a payload that optionally supplies the paging information.
    • Function should check the lambda context to see how much time is remaining before staring each loop and exiting if there is not enough time to process the batch.
    • Function to return the paging information (last evaluated key)
  • Step function looks to see if paging is complete.
    • If not, lambda is called again with paging information passed in.
    • If complete end

I would explore SQS. Have a lambda fetch up to 25 records (the max) in batch, do what it needs and mark the records (such as updating a timestamp on them - use that timestamp ass a filter to ensure that your fetches are only always fetching records that need updating. You can keep fetching records. Eventually the lambda will timeout but since it did not finish you will not have . had a chance to mark your SQS job as complete by deleting task. SQS jobs have a visibility period which when it ends causes them to reappear in the queue thereby causing a lambda to run another batch until eventually the lambda finds no more records and can then remove the SQS job. We use this to refresh elasticsearch indices with all records when our index mapping changes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM