Akka-streams 基于时间的分组

Question

I have an application which listens to a stream of events.我有一个应用程序可以监听 stream 事件。 These events tend to come in chunks: 10 to 20 of them within the same second, with minutes or even hours of silence between them.这些事件往往成块出现：10 到 20 个在同一秒内发生，它们之间有几分钟甚至几小时的沉默。 These events are processed and result in an aggregate state, and this updated state is sent further downstream.这些事件被处理并产生一个聚合 state，并且这个更新的 state 被进一步发送到下游。

In pseudo code, it would look something like this:在伪代码中，它看起来像这样：

kafkaSource()
  .mapAsync(1)((entityId, event) => entityProcessor(entityId).process(event)) // yields entityState
  .mapAsync(1)(entityState => submitStateToExternalService(entityState))
  .runWith(kafkaCommitterSink)

The thing is that the downstream submitStateToExternalService has no use for 10-20 updated states per second - it would be far more efficient to just emit the last one and only handle that one.问题是下游submitStateToExternalService每秒更新 10-20 个状态是没有用的 - 只发出最后一个状态并只处理那个状态会更有效。

With that in mind, I started looking if it wouldn't be possible to not emit the state after processing immediately, and instead wait a little while to see if more events are coming in.考虑到这一点，我开始寻找是否不可能在立即处理后不发出 state，而是等待一段时间，看看是否有更多事件进入。

In a way, it's similar to conflate , but that emits elements as soon as the downstream stops backpressuring, and my processing is actually fast enough to keep up with the events coming in, so I can't rely on backpressure.在某种程度上，它类似于conflate ，但是一旦下游停止背压就会发出元素，并且我的处理实际上足够快以跟上进来的事件，所以我不能依赖背压。

I came across groupedWithin , but this emits elements whenever the window ends (or the max number of elements is reached).我遇到了groupedWithin ，但是只要 window 结束（或达到最大元素数），就会发出元素。 What I would ideally want, is a time window where the waiting time before emitting downstream is reset by each new element in the group.我理想情况下想要的是一个时间 window ，其中向下游发射之前的等待时间由组中的每个新元素重置。

Before I implement something to do this myself, I wanted to make sure that I didn't just overlook a way of doing this that is already present in akka-streams, because this seems like a fairly common thing to do.在我自己实现一些东西来做这件事之前，我想确保我不只是忽略了一种已经存在于 akka-streams 中的方法，因为这似乎是一件相当普遍的事情。

Answer 1

Honestly, I would make entityProcessor into an cluster sharded persistent actor.老实说，我会把entityProcessor变成一个集群分片的持久化actor。

case class ProcessEvent(entityId: String, evt: EntityEvent)

val entityRegion = ClusterSharding(system).shardRegion("entity")

kafkaSource()
  .mapAsync(parallelism) { (entityId, event) =>
    entityRegion ? ProcessEvent(entityId, event)
  }
  .runWith(kafkaCommitterSink)

With this, you can safely increase the parallelism so that you can handle events for multiple entities simultaneously without fear of mis-ordering the events for any particular entity.有了这个，您可以安全地增加并行度，以便您可以同时处理多个实体的事件，而不必担心对任何特定实体的事件进行错误排序。

Your entity actors would then update their state in response to the process commands and persist the events using a suitable persistence plugin, sending a reply to complete the ask pattern.然后，您的实体参与者将更新其 state 以响应流程命令并使用合适的持久性插件持久化事件，发送回复以完成询问模式。 One way to get the compaction effect you're looking for is for them to schedule the update of the external service after some period of time (after cancelling any previously scheduled update).获得您正在寻找的压缩效果的一种方法是让他们在一段时间后安排外部服务的更新（在取消任何先前安排的更新之后）。

There is one potential pitfall with this scheme (it's also a potential issue with a homemade Akka Stream solution to allow n > 1 events to be processed before updating the state): what happens if the service fails between updating the local view of state and updating the external service?此方案存在一个潜在缺陷（自制 Akka Stream 解决方案也存在潜在问题，允许在更新状态之前处理n > 1 个事件）：如果服务在更新 Z9ED369E2EA9942EF5876 的本地视图之间失败会发生什么情况外部服务？

One way you can deal with this is to encode whether the entity is dirty (has state which hasn't propagated to the external service) in the entity's state and at startup build a list of entities and run through them to have dirty entities update the external state.解决此问题的一种方法是在实体的 state 中对实体是否脏（具有尚未传播到外部服务的 state）进行编码，并在启动时构建实体列表并运行它们以让脏实体更新外部 state。

If the entities are doing more than just tracking state for publishing to a single external datastore, it might be useful to use Akka Persistence Query to build a full-fledged read-side view to update the external service.如果实体所做的不仅仅是跟踪 state 以发布到单个外部数据存储，则使用 Akka 持久性查询来构建完整的读取端视图以更新外部服务可能很有用。 In this case, though, since the read-side view's (State, Event) => State transition would be the same as the entity processor's, it might not make sense to go this way.但是，在这种情况下，由于读取端视图的(State, Event) => State转换与实体处理器的转换相同，因此 go 这种方式可能没有意义。

A midway alternative would be to offload the scheduling etc. to a different actor or set of actors which get told "this entity updated it's state" and then schedule an ask of the entity for its current state with a timestamp of when the state was locally updated.一种中途替代方法是将调度等卸载到不同的参与者或一组参与者，这些参与者被告知“该实体更新了它的状态”，然后安排该实体询问其当前 state，时间戳为 state 在本地时的时间戳更新。 When the response is received, the external service is updated, if the timestamp is newer than the last update.收到响应后，如果时间戳比上次更新更新，则更新外部服务。

Akka-streams 基于时间的分组

问题描述

1 个解决方案

解决方案1
2 2020-07-24 20:51:15

Akka-streams 基于时间的分组

问题描述

1 个解决方案

解决方案1 2 2020-07-24 20:51:15

解决方案1
2 2020-07-24 20:51:15