缓冲和刷新 Apache Beam 流数据

Question

I have a streaming job that with initial run will have to process large amount of data.我有一个流作业，初始运行时必须处理大量数据。 One of DoFn calls remote service that supports batch requests, so when working with bounded collections I use following approach: DoFn 之一调用支持批处理请求的远程服务，因此在处理有界集合时，我使用以下方法：

  private static final class Function extends DoFn<String, Void> implements Serializable {
    private static final long serialVersionUID = 2417984990958377700L;

    private static final int LIMIT = 500;

    private transient Queue<String> buffered;

    @StartBundle
    public void startBundle(Context context) throws Exception {
      buffered = new LinkedList<>();
    }

    @ProcessElement
    public void processElement(ProcessContext context) throws Exception {
      buffered.add(context.element());

      if (buffered.size() > LIMIT) {
        flush();
      }
    }

    @FinishBundle
    public void finishBundle(Context c) throws Exception {
      // process remaining
      flush();
    }

    private void flush() {
      // build batch request
      while (!buffered.isEmpty()) {
        buffered.poll();
        // do something
      }
    }
  }

Is there a way to window data so the same approach can be used on unbounded collections?有没有办法窗口数据，以便可以在无界集合上使用相同的方法？

I've tried following:我试过以下：

pipeline
    .apply("Read", Read.from(source))
    .apply(WithTimestamps.of(input -> Instant.now()))
    .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2L))))
    .apply("Process", ParDo.of(new Function()));

but startBundle and finishBundle are called for every element.但是startBundle和finishBundle会为每个元素调用。 Is there a chance to have something like with RxJava (2 minute windows or 100 element bundles):是否有机会使用 RxJava（2 分钟窗口或 100 个元素包）：

source
    .toFlowable(BackpressureStrategy.LATEST)
    .buffer(2, TimeUnit.MINUTES, 100)

Answer 1

This is a quintessential use case for the new feature of per-key-and-windows state and timers .这是 per-key-and-windows state和timers新功能的典型用例。

State is described in a Beam blog post , while for timers you'll have to rely on the Javadoc.状态在Beam 博客文章中进行了描述，而对于计时器，您必须依赖于 Javadoc。 Nevermind what the javadoc says about runners supporting them, the true status is found in Beam's capability matrix .不管 javadoc 怎么说支持他们的跑步者，真正的状态可以在 Beam 的能力矩阵中找到。

The pattern is very much like what you have written, but state allows it to work with windows and also across bundles, since they may be very small in streaming.该模式与您编写的非常相似，但状态允许它与窗口一起工作，也可以跨包工作，因为它们在流中可能非常小。 Since state must be partitioned somehow to maintain parallelism, you'll need to add some sort of key.由于必须以某种方式对状态进行分区以保持并行性，因此您需要添加某种键。 Currently there is no automatic sharding for this.目前没有自动分片。

private static final class Function extends DoFn<KV<Key, String>, Void> implements Serializable {
  private static final long serialVersionUID = 2417984990958377700L;

  private static final int LIMIT = 500;

  @StateId("bufferedSize")
  private final StateSpec<Object, ValueState<Integer>> bufferedSizeSpec =
      StateSpecs.value(VarIntCoder.of());

  @StateId("buffered")
  private final StateSpec<Object, BagState<String>> bufferedSpec =
      StateSpecs.bag(StringUtf8Coder.of());

  @TimerId("expiry")
  private final TimerSpec expirySpec = TimerSpecs.timer(TimeDomain.EVENT_TIME);

  @ProcessElement
  public void processElement(
      ProcessContext context,
      BoundedWindow window,
      @StateId("bufferedSize") ValueState<Integer> bufferedSizeState,
      @StateId("buffered") BagState<String> bufferedState,
      @TimerId("expiry") Timer expiryTimer) {

    int size = firstNonNull(bufferedSizeState.read(), 0);
    bufferedState.add(context.element().getValue());
    size += 1;
    bufferedSizeState.write(size);
    expiryTimer.set(window.maxTimestamp().plus(allowedLateness));

    if (size > LIMIT) {
      flush(context, bufferedState, bufferedSizeState);
    }
  }

  @OnTimer("expiry")
  public void onExpiry(
      OnTimerContext context,
      @StateId("bufferedSize") ValueState<Integer> bufferedSizeState,
      @StateId("buffered") BagState<String> bufferedState) {
    flush(context, bufferedState, bufferedSizeState);
  }

  private void flush(
      WindowedContext context,
      BagState<String> bufferedState,
      ValueState<Integer> bufferedSizeState) {
    Iterable<String> buffered = bufferedState.read();

    // build batch request from buffered
    ...

    // clear things
    bufferedState.clear();
    bufferedSizeState.clear();
  }
}

Taking a few notes here:在这里做一些笔记：

State replaces your DoFn 's instance variables, since instance variables have no cohesion across windows. State 替换了DoFn的实例变量，因为实例变量跨窗口没有内聚力。
The buffer and the size are just initialized as needed instead of @StartBundle .缓冲区和大小只是根据需要而不是@StartBundle初始化。
The BagState supports "blind" writes, so there doesn't need to be any read-modify-write, just committing the new elements in the same way as when you output. BagState支持“盲”写，因此不需要任何读-修改-写，只需以与输出时相同的方式提交新元素即可。
Setting a timer repeatedly for the same time is just fine;在同一时间重复设置一个计时器就好了； it should mostly be a noop.它应该主要是一个 noop。
@OnTimer("expiry") takes the place of @FinishBundle , since finishing a bundle is not a per-window thing but an artifact of how a runner executes your pipeline. @OnTimer("expiry")代替了@FinishBundle ，因为完成一个包不是每个窗口的事情，而是一个运行器如何执行你的管道的工件。

All that said, if you are writing to an external system, perhaps you would want to reify the windows and re-window into the global window before just doing writes where the manner of your write depends on the window, since "the external world is globally windowed".综上所述，如果您正在向外部系统写入数据，那么在写入取决于窗口的写入方式之前，您可能希望将窗口具体化并重新窗口化到全局窗口中，因为“外部世界是全局窗口”。

Answer 2

apache beam 0.6.0 的文档说 StateId “当前不受任何跑步者支持”。

缓冲和刷新 Apache Beam 流数据

问题描述

2 个解决方案

解决方案1
9 已采纳 2017-03-21 02:16:48

解决方案2
0 2017-03-22 11:48:31

缓冲和刷新 Apache Beam 流数据

问题描述

2 个解决方案

解决方案1 9 已采纳 2017-03-21 02:16:48

解决方案2 0 2017-03-22 11:48:31

解决方案1
9 已采纳 2017-03-21 02:16:48

解决方案2
0 2017-03-22 11:48:31