[英]Apache Beam :: can't get groupbykey work with session window with java
I've a simple problem.我有一个简单的问题。 Lets say I'm reading a parquet file which produces an avro GenericRecord
object per row, as below.假设我正在读取一个 parquet 文件,该文件每行生成一个 avro GenericRecord
object,如下所示。
{"name":"john", "surename":"doe", "age":40, "user_pk":"john:doe:40", "unique_attribute":"j1"}
{"name":"john", "surename":"doe", "age":40, "user_pk":"john:doe:40", "unique_attribute":"j2"}
{"name":"john", "surename":"doe", "age":40, "user_pk":"john:doe:40", "unique_attribute":"j3"}
{"name":"john", "surename":"doe", "age":40, "user_pk":"john:doe:40", "unique_attribute":"j4"}
{"name":"paul", "surename":"carl", "age":28, "user_pk":"paul:carl:28", "unique_attribute":"p1"}
{"name":"paul", "surename":"carl", "age":28, "user_pk":"paul:carl:28", "unique_attribute":"p2"}
{"name":"paul", "surename":"carl", "age":28, "user_pk":"paul:carl:28", "unique_attribute":"p3"}
this file was flatten on purpose and I would like to un-flatten them.该文件是故意展平的,我想取消展平它们。
KV<String, Iterable<GenericRecord>>
or even better combined KV<String, GenericRecord>
.我们知道输入是有序的,我想处理它们直到下一个 session 键,并传递给管道中的下一个应用,以保持 memory 要求最小,所以中间阶段应该返回KV<String, Iterable<GenericRecord>>
甚至更好地组合KV<String, GenericRecord>
。<"john:doe:40", {"name":"john", "surename":"doe", "age":40, ["unique_attribute":"j1", ...]}>
<"paul:carl:28", {"name":"paul", "surename":"carl", "age":28, "user_pk":, ["unique_attribute":"p1", ...]}
this is what I've got so far;这就是我到目前为止所得到的;
pipeline.apply("FilePattern", FileIO.match().filepattern(PARQUET_FILE_PATTERN))
.apply("FileReadMatches", FileIO.readMatches())
.apply("ParquetReadFiles", ParquetIO.readFiles(schema))
.apply("SetKeyValuePK", WithKeys.of(input -> AvroSupport.of(input).extractString("user_pk").get())).setCoder(KvCoder.of(StringUtf8Coder.of(), AvroCoder.of(schema)))
.apply(Window.into(Sessions.withGapDuration(Duration.standardSeconds(5L)))).setCoder(KvCoder.of(StringUtf8Coder.of(), AvroCoder.of(schema)))
.apply("SetGroupByPK", GroupByKey.create()).setCoder(KvCoder.of(StringUtf8Coder.of(), IterableCoder.of(AvroCoder.of(schema))))
...
...
I don't know if there is a better way of doing it but for now I've used Sessions.withGapDuration
windowing strategy.我不知道是否有更好的方法,但现在我使用了Sessions.withGapDuration
窗口策略。 I expected I would get a grouped element KV<String, Iterable<GenericRecord>> element
in every ~5seconds, but I'm not getting anything after GroupByKey
, I'm not even sure if GroupByKey
is actually doing anything, but I know that memory is increasing rapidly so it must be waiting for all the items.我预计我会在每 ~5 秒内获得一个分组元素KV<String, Iterable<GenericRecord>> element
,但是在GroupByKey
之后我没有得到任何东西,我什至不确定GroupByKey
是否真的在做任何事情,但我知道memory 正在迅速增加,因此它必须等待所有项目。
So the question is, how would you setup a windowing function that will allow me to groupbykey.所以问题是,你将如何设置一个窗口 function 允许我分组键。 I've also tried Combine.byKey
, as it is suppose to be GroupByKey + Windowing Function
but couldn't implemented?我也尝试过Combine.byKey
,因为它应该是GroupByKey + Windowing Function
但无法实现?
I've managed to get the groupby working, but not sure if I understand fully.我已经设法让 groupby 工作,但不确定我是否完全理解。 I had to add two thinks.我不得不添加两个想法。 First one (any?) IO operations in Beam doesn't add timestamp. Beam 中的第一个(任何?) IO 操作不添加时间戳。
.apply("WithTimestamp", WithTimestamps.of(input -> Instant.now()))
second I've added a Triger
so the GroupByKey
would actually get triggered.其次,我添加了一个Triger
,因此GroupByKey
实际上会被触发。 No idea why it wasn't triggering in the first place.不知道为什么它没有首先触发。 I'm sure someone has an explanation for this.我相信有人对此有解释。
.apply("SessionWindow", Window.<KV<String, GenericRecord>>into(Sessions.withGapDuration(Duration.standardSeconds(5L))).triggering(
AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterProcessingTime
.pastFirstElementInPane().plusDelayOf(Duration.ZERO)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes())
It is not perfect, still had to wait couple of minutes before I see that the GroupByKey
gets triggered even though the window is only 5s
, but it gets triggered in the end, which is progress.它并不完美,仍然需要等待几分钟才能看到GroupByKey
被触发,即使 window 只有5s
,但它最终被触发,这是进步。
EDIT: ok it looks like timestamp wasn't needed, I'm assuming because the window is session based and not time based.编辑:好的,看起来不需要时间戳,我假设因为 window 是基于 session 而不是基于时间的。 I've also change the setting to streaming我也将设置更改为流式传输
options.as(StreamingOptions.class)
.setStreaming(true);
I hope this helps to someone who is having a similar issues.我希望这对遇到类似问题的人有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.