简体   繁体   中英

Processing streaming data by multiples lambda functions AWS

I have a streaming data, and I have to use at least 2 lambda functions to process each coming record of data. Basically, when a lambda functions processes a coming record, it will validate several conditions and finally it will generate a new row data and save it to my database, this new row has a column called 'occurrence' denote the occurrence of this new row in my database.I have to use more than 2 lambda functions because the coming data is so huge and I want to speed up the process. But this causes a problem, more specifically, let say that I have 2 lambda functions A and B, and there are 2 coming record x and y at once and A processes x, B processes y. After processing x and y, A and B will generate the new rows x' and y' respectively. But there is a possibility that x' will be the same as y' and when A and B save x' and y' to my database, the occurences columns of x' (also y') only increases by 1 (because A and B will first find the existing data for x'(also y') in my database, and then update the column 'occurrence' ), but the increase here should be 2. This problem will not happen if I use only one lambda function. I think the solution should be from how I store and update data or how I manage 2 running lambda functions at once. But so far, I could not figure it out. Any help will be appreciated.

You are right, you need to be careful when storing data to ensure there are no duplicates. And when updating, ensure you are not conflicting with another update operation. However, this is usually achieved by leveraging a feature of the database. Hence, how to achieve this is also dependent on the database you use.

Example 1 - DynamoDB

In DynamoDB, this can be handled using conditional updates. See this example . The idea is to specify the current value during updates, so that the update will be successful only if the current value has not changed.

So in your application, first retrieve the current value. For example, say the DB tells you the current value is 5 . Then you tell DynamoDB:

update to: 6
if the current value is: 5

If another update has occurred concurrently, the current value will be 6 . So when DynamoDB executes your update statement above, the condition check will fail. Then you just have to try again (retrieve the latest value and try updating again).

Example 2 - MySQL

In a relational database, usually an update statement automatically places a locking mechanism. Meaning you don't really need to do anything, as long as you don't hard-code the current value in the update statement.

Meaning, doing this is fine:

UPDATE table1 SET occurrence = occurrence + 1 where ID = ${id};

The database will ensure that this statement is executed sequentially, even if 2 or more of them are submitted concurrently.

But, don't do this:

UPDATE table1 SET occurrence = ${current_value} + 1 where ID = ${id};

because this is hard-coding the value.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM