简体繁体 English

使用kinesis流和firehose对流数据进行排序

[英]Ordering of streaming data with kinesis stream and firehose

原文 2017-04-04 09:06:55 5 1 amazon-web-services/ amazon-s3/ aws-lambda/ amazon-kinesis/ amazon-kinesis-firehose

I have an architecture dilemma for my current project which is for near realtime processing of big amount of data. 我目前的项目存在架构困境，即近实时处理大量数据。 So here is a diagram of the the current architecture: 所以这是当前架构的图表：

Here is an explanation of my idea which led me to that picture: 以下是我的想法的解释，这让我想到了这张照片：

When the API gateway receives a request it's put in the stream(this is because of the nature of my application- "fire and forget) That's how I came up to that conclusion . The input data is separated in the shards based on a specific request attribute which guarantees me the correct order. 当API网关收到一个请求时，它被放入流中（这是因为我的应用程序的性质 - “火与忘记”）这就是我得出的结论。输入数据根据特定请求在分片中分离属性，保证我正确的顺序。

Then I have a lambda which cares for validating the input and anomaly detection. 然后我有一个lambda，它关心验证输入和异常检测。 So it's an abstraction which keeps the data clean for the next layer- the data enrichment. 因此，它是一种抽象，可以保持下一层数据的清洁 - 数据丰富。 So this lambda sends the data to a kinesis firehose because it can backup the "raw" data(something which I definitely want to have) and also attach a transformation lambda which will do the enrichment- so I won't care for saving the data in S3, it will come out of the box. 所以这个lambda将数据发送到kinesis firehose，因为它可以备份“原始”数据（我绝对想要的东西），还附加一个转换lambda，它将进行浓缩 - 所以我不关心保存数据在S3中，它将开箱即用。 So everything is great until the moment where I need a preserved ordering of the received data(the enricher is doing sessionization), which is lost in the firehose, because there's no data separation there as it's in the kinesis streams. 所以一切都很好，直到我需要保存的接收数据排序（富集程序正在进行会话化），这在firehose中丢失，因为在kinesis流中没有数据分离。

So the only thing I could think of is- to move the sissionization in the first lambda, which will break my abstraction, because it will start caring about data enrichment and the bigger drawback is that the backup data will have enriched data in it, which is also breaking the architecture. 所以我唯一能想到的就是 - 在第一个lambda中移动sissionization，这将破坏我的抽象，因为它将开始关注数据丰富，更大的缺点是备份数据将丰富其中的数据，也打破了架构。 And all this is happening because the missing sharding conception in the firehose. 所有这一切都在发生，因为在消防中缺少分片概念。

So can someone think of a solution of that problem without losing the out of the box features which aws provides us? 那么有人可以想到解决这个问题而不会失去aws为我们提供的开箱即用功能吗？

1 个解决方案

I think that sessionization and data enrichment are two different abstractions, will need to be split between the lambdas. 我认为会话化和数据丰富是两种不同的抽象，需要在lambda之间进行分割。

A session is a time bound, strictly ordered flow of events that are bounded by a purpose or task. 会话是受目的或任务限制的时间限制，严格排序的事件流。 You only have that information at the first lambda stage (from the kinesis stream categorization), and should label flows with session context at the source and where sessions can be bounded. 您只在第一个lambda阶段（来自kinesis流分类）拥有该信息，并且应该在源处标记具有会话上下文的流并且可以限制会话。

If storing session information in a backup is a problem, it may be that the definition of a session is not well specified or subject to redefinition. 如果在备份中存储会话信息是一个问题，则可能是会话的定义没有很好地指定或者需要重新定义。 If sessions are subject to future recasting, the session data already calculated can be ignored, provided enough additional data to inform the unpredictable future concepts of possible sessions has also been recorded with enough detail. 如果会话将来重新进行，则可以忽略已经计算的会话数据，只要有足够的详细信息记录了足够的额外数据以告知可能会话的不可预测的未来概念。

Additional enrichment providing business context (aka externally identifiable data) should process the sessions transactionally within the previously recorded boundaries. 提供业务上下文（也称为外部可识别数据）的附加富集应在先前记录的边界内以事务方式处理会话。

If sessions aren't transactional at the business level, then the definition of a session is over or under specified. 如果会话在业务级别不是事务性的，则会话的定义超出或低于指定。 If that is the case, you are out of the stream processing business and into batch processing, where you will need to scale state to the number of possible simultaneous interleaved sessions and their maximum durations -- querying the entire corpus of events to bracket sessions of hopefully manageable time durations. 如果是这种情况，您就不在流处理业务和批处理中，您需要将状态扩展到可能的同时交错会话的数量及其最大持续时间 - 查询整个事件语料库以支持会话希望可以控制的持续时间。