简体繁体中英

Ordering of streaming data with kinesis stream and firehose

原文 2017-04-04 09:06:55 9 1 amazon-web-services/ amazon-s3/ aws-lambda/ amazon-kinesis/ amazon-kinesis-firehose

I have an architecture dilemma for my current project which is for near realtime processing of big amount of data. So here is a diagram of the the current architecture:

Here is an explanation of my idea which led me to that picture:

When the API gateway receives a request it's put in the stream(this is because of the nature of my application- "fire and forget) That's how I came up to that conclusion . The input data is separated in the shards based on a specific request attribute which guarantees me the correct order.

Then I have a lambda which cares for validating the input and anomaly detection. So it's an abstraction which keeps the data clean for the next layer- the data enrichment. So this lambda sends the data to a kinesis firehose because it can backup the "raw" data(something which I definitely want to have) and also attach a transformation lambda which will do the enrichment- so I won't care for saving the data in S3, it will come out of the box. So everything is great until the moment where I need a preserved ordering of the received data(the enricher is doing sessionization), which is lost in the firehose, because there's no data separation there as it's in the kinesis streams.

So the only thing I could think of is- to move the sissionization in the first lambda, which will break my abstraction, because it will start caring about data enrichment and the bigger drawback is that the backup data will have enriched data in it, which is also breaking the architecture. And all this is happening because the missing sharding conception in the firehose.

So can someone think of a solution of that problem without losing the out of the box features which aws provides us?

1 answers

I think that sessionization and data enrichment are two different abstractions, will need to be split between the lambdas.

A session is a time bound, strictly ordered flow of events that are bounded by a purpose or task. You only have that information at the first lambda stage (from the kinesis stream categorization), and should label flows with session context at the source and where sessions can be bounded.

If storing session information in a backup is a problem, it may be that the definition of a session is not well specified or subject to redefinition. If sessions are subject to future recasting, the session data already calculated can be ignored, provided enough additional data to inform the unpredictable future concepts of possible sessions has also been recorded with enough detail.

Additional enrichment providing business context (aka externally identifiable data) should process the sessions transactionally within the previously recorded boundaries.

If sessions aren't transactional at the business level, then the definition of a session is over or under specified. If that is the case, you are out of the stream processing business and into batch processing, where you will need to scale state to the number of possible simultaneous interleaved sessions and their maximum durations -- querying the entire corpus of events to bracket sessions of hopefully manageable time durations.

Regarding KInesis Firehose data stream to AWS Lambda

Kinesis Data Firehose source `Direct PUT` vs `Kinesis Data Stream`

AWS: reading Kinesis Stream data using Kinesis Firehose in a different account

Issues with streaming data to AWS Kinesis Firehose from Python

Stream Data from SQL Server into Redshift with Kinesis Firehose

How to integrate KPL (Kinesis Producer Library) to Kinesis firehose directly without going through Kinesis Data Stream

What is the difference/use case for Kinesis services of Firehose, pipeline, data stream

How does kinesis firehose stream data to self managed elasticsearch?

Kinesis Stream and Kinesis Firehose Updating Elasticsearch Indexes

Auto wire kinesis stream to kinesis firehose?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Regarding KInesis Firehose data stream to AWS Lambda Kinesis Data Firehose source `Direct PUT` vs `Kinesis Data Stream` AWS: reading Kinesis Stream data using Kinesis Firehose in a different account Issues with streaming data to AWS Kinesis Firehose from Python Stream Data from SQL Server into Redshift with Kinesis Firehose How to integrate KPL (Kinesis Producer Library) to Kinesis firehose directly without going through Kinesis Data Stream What is the difference/use case for Kinesis services of Firehose, pipeline, data stream How does kinesis firehose stream data to self managed elasticsearch? Kinesis Stream and Kinesis Firehose Updating Elasticsearch Indexes Auto wire kinesis stream to kinesis firehose?

Related Tags

Ordering of streaming data with kinesis stream and firehose

Question

1 answers

solution1 0 2017-05-19 15:36:21

solution1
0 2017-05-19 15:36:21