简体   繁体   English

Apache Beam:: 无法与 session window 一起使用 groupbykey 工作

[英]Apache Beam :: can't get groupbykey work with session window with java

I've a simple problem.我有一个简单的问题。 Lets say I'm reading a parquet file which produces an avro GenericRecord object per row, as below.假设我正在读取一个 parquet 文件,该文件每行生成一个 avro GenericRecord object,如下所示。

{"name":"john", "surename":"doe", "age":40, "user_pk":"john:doe:40", "unique_attribute":"j1"}
{"name":"john", "surename":"doe", "age":40, "user_pk":"john:doe:40", "unique_attribute":"j2"}
{"name":"john", "surename":"doe", "age":40, "user_pk":"john:doe:40", "unique_attribute":"j3"}
{"name":"john", "surename":"doe", "age":40, "user_pk":"john:doe:40", "unique_attribute":"j4"}

{"name":"paul", "surename":"carl", "age":28, "user_pk":"paul:carl:28", "unique_attribute":"p1"}
{"name":"paul", "surename":"carl", "age":28, "user_pk":"paul:carl:28", "unique_attribute":"p2"}
{"name":"paul", "surename":"carl", "age":28, "user_pk":"paul:carl:28", "unique_attribute":"p3"}

this file was flatten on purpose and I would like to un-flatten them.该文件是故意展平的,我想取消展平它们。

  • we know that the input is ordered and I would like to process them until the next session key, and pass to the next apply in the pipeline, to keep the memory requirement minimal, so the intermediary stage should return KV<String, Iterable<GenericRecord>> or even better combined KV<String, GenericRecord> .我们知道输入是有序的,我想处理它们直到下一个 session 键,并传递给管道中的下一个应用,以保持 memory 要求最小,所以中间阶段应该返回KV<String, Iterable<GenericRecord>>甚至更好地组合KV<String, GenericRecord>
<"john:doe:40", {"name":"john", "surename":"doe", "age":40, ["unique_attribute":"j1", ...]}>
<"paul:carl:28", {"name":"paul", "surename":"carl", "age":28, "user_pk":, ["unique_attribute":"p1", ...]}

this is what I've got so far;这就是我到目前为止所得到的;

        pipeline.apply("FilePattern", FileIO.match().filepattern(PARQUET_FILE_PATTERN))
                .apply("FileReadMatches", FileIO.readMatches())
                .apply("ParquetReadFiles", ParquetIO.readFiles(schema))
                .apply("SetKeyValuePK", WithKeys.of(input -> AvroSupport.of(input).extractString("user_pk").get())).setCoder(KvCoder.of(StringUtf8Coder.of(), AvroCoder.of(schema)))
                .apply(Window.into(Sessions.withGapDuration(Duration.standardSeconds(5L)))).setCoder(KvCoder.of(StringUtf8Coder.of(), AvroCoder.of(schema)))
                .apply("SetGroupByPK", GroupByKey.create()).setCoder(KvCoder.of(StringUtf8Coder.of(), IterableCoder.of(AvroCoder.of(schema))))
...
...

I don't know if there is a better way of doing it but for now I've used Sessions.withGapDuration windowing strategy.我不知道是否有更好的方法,但现在我使用了Sessions.withGapDuration窗口策略。 I expected I would get a grouped element KV<String, Iterable<GenericRecord>> element in every ~5seconds, but I'm not getting anything after GroupByKey , I'm not even sure if GroupByKey is actually doing anything, but I know that memory is increasing rapidly so it must be waiting for all the items.我预计我会在每 ~5 秒内获得一个分组元素KV<String, Iterable<GenericRecord>> element ,但是在GroupByKey之后我没有得到任何东西,我什至不确定GroupByKey是否真的在做任何事情,但我知道memory 正在迅速增加,因此它必须等待所有项目。

So the question is, how would you setup a windowing function that will allow me to groupbykey.所以问题是,你将如何设置一个窗口 function 允许我分组键。 I've also tried Combine.byKey , as it is suppose to be GroupByKey + Windowing Function but couldn't implemented?我也尝试过Combine.byKey ,因为它应该是GroupByKey + Windowing Function但无法实现?

I've managed to get the groupby working, but not sure if I understand fully.我已经设法让 groupby 工作,但不确定我是否完全理解。 I had to add two thinks.我不得不添加两个想法。 First one (any?) IO operations in Beam doesn't add timestamp. Beam 中的第一个(任何?) IO 操作不添加时间戳。

.apply("WithTimestamp", WithTimestamps.of(input -> Instant.now()))

second I've added a Triger so the GroupByKey would actually get triggered.其次,我添加了一个Triger ,因此GroupByKey实际上会被触发。 No idea why it wasn't triggering in the first place.不知道为什么它没有首先触发。 I'm sure someone has an explanation for this.我相信有人对此有解释。

.apply("SessionWindow", Window.<KV<String, GenericRecord>>into(Sessions.withGapDuration(Duration.standardSeconds(5L))).triggering(
                        AfterWatermark.pastEndOfWindow()
                                .withLateFirings(AfterProcessingTime
                                        .pastFirstElementInPane().plusDelayOf(Duration.ZERO)))
                        .withAllowedLateness(Duration.ZERO)
                        .discardingFiredPanes())

It is not perfect, still had to wait couple of minutes before I see that the GroupByKey gets triggered even though the window is only 5s , but it gets triggered in the end, which is progress.它并不完美,仍然需要等待几分钟才能看到GroupByKey被触发,即使 window 只有5s ,但它最终被触发,这是进步。

EDIT: ok it looks like timestamp wasn't needed, I'm assuming because the window is session based and not time based.编辑:好的,看起来不需要时间戳,我假设因为 window 是基于 session 而不是基于时间的。 I've also change the setting to streaming我也将设置更改为流式传输

        options.as(StreamingOptions.class)
                .setStreaming(true);

I hope this helps to someone who is having a similar issues.我希望这对遇到类似问题的人有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache 梁的 ID Window Session - ID of Apache Beam Window Session Spark数据集groupByKey不起作用(Java) - Spark Datasets groupByKey doesn't work (Java) 在 Apache Beam 中使用用户会话窗口进行状态处理 - Stateful processing with User Session Window in Apache Beam JAVA - Apache BEAM- GCP:GroupByKey 在 Direct Runner 中运行良好,但在 Dataflow runner 中失败 - JAVA - Apache BEAM- GCP: GroupByKey works fine with Direct Runner but fails with Dataflow runner Apache Beam / Java / Dataflow - 当水印命中时,带有早期触发器的会话窗口不会触发“准时”窗格 - Apache Beam / Java / Dataflow - Session window with early trigger not firing the “on-time” pane when when the watermark hits 使用 Java 进行 Apache Beam 编码 - Apache Beam Coding with Java Java Apache Beam PCollections 以及如何使它们工作? - Java Apache Beam PCollections and how to make them work? 无法在java servlet中获取会话 - Can't get the session in java servlet Apache Beam管道从csv文件读取,拆分,groupbyKey并写入文本文件时出现“ IllegalStateException”错误。 为什么? - “IllegalStateException” error for Apache Beam pipeline to read from csv file, split, groupbyKey and write to text file. Why? Apache Beam/Java,如何设置每个 window 仅发送一次数据的窗口/触发器 - Apache Beam/Java, how to set window/trigger that sends the data only once for each window
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM