简体   繁体   English

使用Kafka Streams来窗口化数据并立即处理每个窗口

[英]Use Kafka Streams for windowing data and processing each window at once

The purpose I want to achieve is to group by user some messages I receive from a Kafka topic and window them in order to aggregate the messages I receive in the (5 minutes) window. 我想要达到的目的是按用户分组我从Kafka主题收到的一些消息并将它们窗口化,以便聚合我在(5分钟)窗口中收到的消息。 Then I'd like to collect all aggregates in each window in order to process them at once adding them to a report of all the messages I received in the 5 minutes interval. 然后我想收集每个窗口中的所有聚合,以便立即处理它们,将它们添加到我在5分钟间隔内收到的所有消息的报告中。

The last point seems to be the tough part as Kafka Streams doesn't seem to provide (at least I can't find it!) anything that can collect all the window related stuff in a "finite" stream to be processed in one place. 最后一点似乎是艰难的部分,因为Kafka Streams似乎没有提供(至少我找不到它!)任何可以在“有限”流中收集所有窗口相关内容以便在一个地方处理的东西。

This is the code I implemented 这是我实现的代码

StreamsBuilder builder = new StreamsBuilder();
KStream<UserId, Message> messages = builder.stream("KAFKA_TOPIC");

TimeWindowedKStream<UserId, Message> windowedMessages =
        messages.
                groupByKey().windowedBy(TimeWindows.of(SIZE_MS));

KTable<Windowed<UserId>, List<Message>> messagesAggregatedByWindow =
        windowedMessages.
                aggregate(
                        () -> new LinkedList<>(), new MyAggregator<>(),
                        Materialized.with(new MessageKeySerde(), new MessageListSerde())
                );

messagesAggregatedByWindow.toStream().foreach((key, value) -> log.info("({}), KEY {} MESSAGE {}",  value.size(), key, value.toString()));

KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();

The result is something like 结果是这样的

KEY [UserId(82770583)@1531502760000/1531502770000] Message [Message(userId=UserId(82770583),message="a"),Message(userId=UserId(82770583),message="b"),Message(userId=UserId(82770583),message="d")]
KEY [UserId(77082590)@1531502760000/1531502770000] Message [Message(userId=UserId(77082590),message="g")]
KEY [UserId(85077691)@1531502750000/1531502760000] Message [Message(userId=UserId(85077691),message="h")]
KEY [UserId(79117307)@1531502780000/1531502790000] Message [Message(userId=UserId(79117307),message="e")]
KEY [UserId(73176289)@1531502760000/1531502770000] Message [Message(userId=UserId(73176289),message="r"),Message(userId=UserId(73176289),message="q")]
KEY [UserId(92077080)@1531502760000/1531502770000] Message [Message(userId=UserId(92077080),message="w")]
KEY [UserId(78530050)@1531502760000/1531502770000] Message [Message(userId=UserId(78530050),message="t")]
KEY [UserId(64640536)@1531502760000/1531502770000] Message [Message(userId=UserId(64640536),message="y")]

For each window there are many log lines and they are mixed with the other windows. 对于每个窗口,有许多日志行,它们与其他窗口混合。

What I'd like to have is something like: 我想拥有的是:

// Hypothetical implementation
windowedMessages.streamWindows((interval, window) -> process(interval, window));

where method process would be something like: 方法过程将是这样的:

// Hypothetical implementation

void process(Interval interval, WindowStream<UserId, List<Message>> windowStream) {
// Create report for the whole window   
Report report = new Report(nameFromInterval());
    // Loop on the finite iterable that represents the window content
    for (WindowStreamEntry<UserId, List<Message>> entry: windowStream) {
        report.addLine(entry.getKey(), entry.getValue());
    }
    report.close();
}

The result would be grouped like this (each report is a call to my callback: void process(...)) and the commit of each window would be committed when the whole window is processed: 结果将像这样分组(每个报告都是对我的回调的调用:void process(...))并且在处理整个窗口时将提交每个窗口的提交:

Report 1:
    KEY [UserId(85077691)@1531502750000/1531502760000] Message [Message(userId=UserId(85077691),message="h")]

Report 2:
    KEY [UserId(82770583)@1531502760000/1531502770000] Message [Message(userId=UserId(82770583),message="a"),Message(userId=UserId(82770583),message="b"),Message(userId=UserId(82770583),message="d")]
    KEY [UserId(77082590)@1531502760000/1531502770000] Message [Message(userId=UserId(77082590),message="g")]
    KEY [UserId(73176289)@1531502760000/1531502770000] Message [Message(userId=UserId(73176289),message="r"),Message(userId=UserId(73176289),message="q")]
    KEY [UserId(92077080)@1531502760000/1531502770000] Message [Message(userId=UserId(92077080),message="w")]
    KEY [UserId(78530050)@1531502760000/1531502770000] Message [Message(userId=UserId(78530050),message="t")]
    KEY [UserId(64640536)@1531502760000/1531502770000] Message [Message(userId=UserId(64640536),message="y")]

Report 3
    KEY [UserId(79117307)@1531502780000/1531502790000] Message [Message(userId=UserId(79117307),message="e")]

I had the same doubt. 我有同样的疑问。 I've talked with the developers of the library and they said that this is a really common request yet not implemented. 我已经与图书馆的开发人员交谈,他们说这是一个非常普遍的请求,但尚未实施。 It will be released soon. 它很快就会发布。

You can find more information here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables 您可以在此处找到更多信息: https//cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM