简体   繁体   中英

Configuring AWS Lambda to a parallel computations for Dynamodb Streams

I have a flask on EC2 and python 3.6 AWS Lambda architecture. When response comes to flask the new item is added to dynamoDB , which triggers Lambda that starts some process with new added item. For some strange reason it doesn't process triggers in parallel, starting new lambda function for each trigger, but processes them one by one.

I tried setting concurrency limit to maximum value, but that didn't work.

I need to get a result as fast as possible and don't manage any of scaling processes by myself. So triggers are need to be processed in parallel not one-by-one as it is now.

  1. If you develop a Lambda function with Python, parallelism doesn't come by default. Lambda supports Python 2.7 and Python 3.6, both of which have multiprocessing and threading modules.
  2. On the other hand, you can use multiprocessing.Pipe instead of multiprocessing.Queue to accomplish what you need without getting any errors during the execution of the Lambda function.

Please refer below link for more details about Source Code for Parallel Execution:

https://aws.amazon.com/blogs/compute/parallel-processing-in-python-with-aws-lambda/

Also, you can refer below code:

import time
import multiprocessing

region_maps = {
        "eu-west-1": {
            "dynamodb":"dynamodb.eu-west-1.amazonaws.com"
        },
        "us-east-1": {
            "dynamodb":"dynamodb.us-east-1.amazonaws.com"
        },
        "us-east-2": {
            "dynamodb": "dynamodb.us-east-2.amazonaws.com"
        }
    }

def multiprocessing_func(region):
    time.sleep(1)
    endpoint = region_maps[region]['dynamodb']
    print('endpoint for {} is {}'.format(region, endpoint))

def lambda_handler(event, context):
    starttime = time.time()
    processes = []
    regions = ['us-east-1', 'us-east-2', 'eu-west-1']
    for region in regions:
        p = multiprocessing.Process(target=multiprocessing_func, args=(region,))
        processes.append(p)
        p.start()

    for process in processes:
        process.join()

    output = 'That took {} seconds'.format(time.time() - starttime)
    print(output)
    return output

Hope this helps.

Number of parallel lambdas are controlled by number of shards you are writing to, in dynamodb.

Amazon DynamoDB, AWS Lambda polls your stream and invokes your Lambda function. When your Lambda function is throttled, Lambda attempts to process the throttled batch of records until the time the data expires. This time period can be up to seven days for Amazon Kinesis. The throttled request is treated as blocking per shard, and Lambda doesn't read any new records from the shard until the throttled batch of records either expires or succeeds. If there is more than one shard in the stream, Lambda continues invoking on the non-throttled shards until one gets through.

source

This is done to control that the events are processed in order they were done on dynamodb. But number of shards are not directly controlled by you.

Now the best thing you can do is,

  1. set a higher Batch size in the lambda function. By doing this you will receive multiple events in the same lambda. You can have parallelism in the lambda function to process all of them together. but this will have obvious drawbacks like what if you are not able to process all of them before lambda times out. you will have to make sure that code is thread safe.

Probably writing to DynamoDB is blocking parallelism in this case.

Alternative architecture for fast and very scalable processing of items: add items to S3 bucket as files. Then trigger on S3 bucket will start Lambda. New file - new Lambda, this way only Lambda concurrency would limit how many lambdas you have in parallel.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM