简体   繁体   English

在 Apache Beam 中使用用户会话窗口进行状态处理

[英]Stateful processing with User Session Window in Apache Beam

Simple usecase where I want to maintain a Value State Counter for events occurring per User Session Window.简单的用例,我想为每个用户会话窗口发生的事件维护一个值状态计数器。

Problem I'm facing is below exception while trying above,在尝试上面时,我面临的问题低于异常,

java.lang.UnsupportedOperationException: MergingWindowFn is not supported for stateful DoFns, WindowFn is: org.apache.beam.sdk.transforms.windowing.Sessions@1d4df
    at org.apache.beam.repackaged.direct_java.runners.core.StatefulDoFnRunner.rejectMergingWindowFn (StatefulDoFnRunner.java:112)
    at org.apache.beam.repackaged.direct_java.runners.core.StatefulDoFnRunner.<init> (StatefulDoFnRunner.java:107)
    at org.apache.beam.repackaged.direct_java.runners.core.DoFnRunners.defaultStatefulDoFnRunner (DoFnRunners.java:157)
    at org.apache.beam.runners.direct.ParDoEvaluator.lambda$defaultRunnerFactory$0 (ParDoEvaluator.java:111)
    at org.apache.beam.runners.direct.ParDoEvaluator.create (ParDoEvaluator.java:156)
    at org.apache.beam.runners.direct.ParDoEvaluatorFactory.createParDoEvaluator (ParDoEvaluatorFactory.java:152)
    at org.apache.beam.runners.direct.ParDoEvaluatorFactory.createEvaluator (ParDoEvaluatorFactory.java:123)
    at org.apache.beam.runners.direct.StatefulParDoEvaluatorFactory.createEvaluator (StatefulParDoEvaluatorFactory.java:109)
    at org.apache.beam.runners.direct.StatefulParDoEvaluatorFactory.forApplication (StatefulParDoEvaluatorFactory.java:89)
    at org.apache.beam.runners.direct.TransformEvaluatorRegistry.forApplication (TransformEvaluatorRegistry.java:178)
    at org.apache.beam.runners.direct.DirectTransformExecutor.run (DirectTransformExecutor.java:122)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511)
    at java.util.concurrent.FutureTask.run (FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624)
    at java.lang.Thread.run (Thread.java:748)

Code snippet where代码片段在哪里

  • data is read from a file (for testing, real scenario would be streaming)从文件中读取数据(为了测试,真实场景将是流式传输)
  • JSON Parsing JSON 解析
  • Timestamp mapping时间戳映射
  • Transform to Key-valued PCollection <SessionId, InputEvent>转换为键值 PCollection <SessionId, InputEvent>
  • Session Windows by Key: sessionId会话窗口按键:sessionId
  • Increment Value State in ParDo - log to verify the counter state ParDo 中的增量值状态 - 记录以验证计数器状态
        pipeline

                // read data from file
                .apply("ReadInputData", TextIO.read().from(options.getInputPath()))

                // parse json
                .apply("ParseJson", ParseJsons.of(InputEvents.class))
                    .setCoder(SerializableCoder.of(InputEvents.class))

                // add timestamp to events
                .apply("AddTimestamp", WithTimestamps.of(
                        (InputEvents events) -> {
                            return Instant.parse(events.getTimestamp(), DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss zzz"));
                        })
                )

                // key value pair for sessionID and events data
                .apply("MapEventsToKV", MapElements.via(
                        new SimpleFunction<InputEvents, KV<String, InputEvents>>() {
                            @Override
                            public KV<String, InputEvents> apply(InputEvents input) {
                                return KV.of(input.getSessionId(), input);
                            }
                        }))

                // window by user session
                .apply("SessionWindows", Window.<KV<String, InputEvents>>into(
                        Sessions.withGapDuration(Duration.standardMinutes(2))
                        .withTimestampCombiner(TimestampCombiner.END_OF_WINDOW)
                )

                // output log
                .apply("Log", ParDo.of(new DoFn<KV<String, InputEvents>, String>() {

                    private static final String COUNTER_NAME = "occurrences_counter";

                    @StateId(COUNTER_NAME)
                    private final StateSpec<ValueState<Integer>> counter = StateSpecs.value(VarIntCoder.of());

                    @ProcessElement
                    public void processElement(@Element KV<String, InputEvents> userSessionEvents,
                                               OutputReceiver<String> outputReceiver,
                                               @StateId(COUNTER_NAME) ValueState<Integer> counterState,
                                               IntervalWindow window) {

                        int currentValue = Optional.ofNullable(counterState.read()).orElse(0);
                        int incrementedCounter = currentValue + 1;
                        counterState.write(incrementedCounter);

                        LOG.info("Window ==> {} :: counterValue ==> {}", window.toString(), incrementedCounter);
                    }
                }));

          return pipeline.run();

Assume the input data looks like this,假设输入数据如下所示,

session_id | event_timestamp        | attr1 | attr2 |
1          |2021-08-29 10:54:54 UTC | x     | xx    |
1          |2021-08-29 10:55:54 UTC | x     | xx    |
2          |2021-08-29 10:55:59 UTC | x     | xx    |
2          |2021-08-29 10:56:35 UTC | x     | xx    |
1          |2021-08-29 10:56:14 UTC | x     | xx    |

Expected output is,预期输出是,

Window ==> 2021-08-29T10:54:54.000Z..2021-08-29T10:58:14.000Z :: counterValue ==> 3
Window ==> 2021-08-29T10:55:59.000Z..2021-08-29T10:58:35.000Z :: counterValue ==> 2

Taking a deeper look in the beam code, I found that the Session windows are MergingWindow and state cannot be maintained across merged windows, hence I faced the mentioned exception.更深入地查看梁代码,我发现会话窗口是MergingWindow并且状态不能跨合并窗口维护,因此我遇到了提到的异常。

Later, I implemented the use case using GlobalWindows and State + Timer .后来,我使用GlobalWindowsState + Timer实现了用例。

Timer is used to reset the counter if no new messages for same session_id for 2 mins.计时器用于在 2 分钟内没有相同 session_id 的新消息时重置计数器。

Ref: https://beam.apache.org/blog/timely-processing参考: https : //beam.apache.org/blog/timely-processing

.apply("GlobalWindows", Window.<KV<String, InputEvents>>into(
        new GlobalWindows()
    )
        .withTimestampCombiner(TimestampCombiner.END_OF_WINDOW)
        .triggering(
                Repeatedly.forever(AfterProcessingTime.
                        pastFirstElementInPane().plusDelayOf(Duration.ZERO)
                )).withAllowedLateness(Duration.ZERO).discardingFiredPanes()
)

.apply("Log", ParDo.of(new DoFn<KV<String, InputEvents>, String>() {

    private static final String COUNTER_NAME = "occurrences_counter";
       private static final String GC_TIMER = "gcTimer";

       @TimerId(GC_TIMER)
       private final TimerSpec gcTimer = TimerSpecs.timer(TimeDomain.PROCESSING_TIME);

    @StateId(COUNTER_NAME)
    private final StateSpec<ValueState<Integer>> counter = StateSpecs.value(VarIntCoder.of());

    @ProcessElement
    public void processElement(@Element KV<String, InputEvents> userSessionEvents,
                               OutputReceiver<String> outputReceiver,
                               @StateId(COUNTER_NAME) ValueState<Integer> counterState,
                               @TimerId(GC_TIMER) Timer gcTimer) {

        int currentValue = Optional.ofNullable(counterState.read()).orElse(0);
        int incrementedCounter = currentValue + 1;
        counterState.write(incrementedCounter);

        gcTimer.offset(Duration.standardMinutes(2)).setRelative();

        LOG.info("Window ==> {} :: counterValue ==> {}", window.toString(), incrementedCounter);
    }

        @OnTimer(GC_TIMER)
        public void onStale(@StateId(COUNTER_NAME) ValueState<Integer> counterState) {
            counterState.clear();
        }
}));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache Beam中有状态处理的问题 - Issues with Stateful processing in Apache Beam 使用状态处理计算Apache Beam中的增量 - Calculating deltas in Apache Beam using Stateful processing Apache 梁的 ID Window Session - ID of Apache Beam Window Session 在Apache Beam中进行缓存:静态变量与状态处理 - Caching in Apache Beam: Static variable vs Stateful processing Apache Beam:: 无法与 session window 一起使用 groupbykey 工作 - Apache Beam :: can't get groupbykey work with session window with java 单元测试 apache 具有外部依赖关系的梁有状态管道 - Unit tests apache beam stateful pipeline with external dependencies 使用Apache Flink进行有状态的复杂事件处理 - Stateful Complex event processing with apache flink Apache Beam / Java / Dataflow - 当水印命中时,带有早期触发器的会话窗口不会触发“准时”窗格 - Apache Beam / Java / Dataflow - Session window with early trigger not firing the “on-time” pane when when the watermark hits Apache Beam流处理json数据 - Apache Beam stream processing of json data Spark Structured Streaming - 在有状态 stream 处理中使用 Window 操作进行事件处理 - Spark Structured Streaming - Event processing with Window operation in stateful stream processing
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM