简体   繁体   English

kafka在消息中流式处理总字数

[英]kafka streams total word count in a message

https://kafka.apache.org/10/documentation/streams/quickstart https://kafka.apache.org/10/documentation/streams/quickstart

I had a question on counting words within a message using kafka streams. 我对使用kafka流计算消息中的单词数有疑问。 Essentially, I'd like to count the total number of words, rather than count each instance of a word. 本质上,我想计算单词总数,而不是计算单词的每个实例。 So, instead of 所以,代替

all     1
streams 1
lead    1
to      1
kafka   1

I need 我需要

totalWordCount   5

or something similar. 或类似的东西。

I tried a variety of things to this part of the code : 我在代码的这一部分尝试了各种方法:

KTable<String, Long> wordCounts = textLines
    .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
    .groupBy((key, value) -> value)
    .count();

such as adding .selectKey((key, value) -> "totalWordCount") in an attempt to change each key (all, streams, etc) to totalWordCount thinking it'll increment itself I've also tried to edit my code using this to try and achieve the total word count. 例如添加.selectKey((key, value) -> "totalWordCount")试图将每个键(所有键,流等)更改为totalWordCount,以为它会自行增加,我也尝试使用此方法编辑代码尝试达到总字数。

I have not succeeded, and after doing some more reading , now I am thinking that I have been approaching this incorrectly. 我没有成功,在多读了一些书之后,现在我认为我一直在错误地处理这个问题。 It seems as if what I need to do is have 3 topics (I've been working with only 2) and have 2 producers where the last producer somehow takes data from the first producer (that shows the word count of each instance) and basically add up the numbers in order to output the total number of words, but I'm not entirely sure how to approach it. 似乎我需要做的是拥有3个主题(我只与2个主题进行过合作)并且有2个生产者,最后一个生产者以某种方式从第一个生产者获取数据(显示每个实例的字数),并且基本上将数字相加即可输出单词总数,但是我不确定如何处理。 Any help/guidance is greatly appreciated. 任何帮助/指导都将不胜感激。 Thanks. 谢谢。

Where did you put the selectKey() ? 您将selectKey()放在哪里? The idea is basically correct, but note, that groupBy() does set the key, too. 这个想法基本上是正确的,但是请注意, groupBy()确实也设置了密钥。

KTable<String, Long> wordCounts = textLines
    .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
    .groupBy((key, value) -> "totalWordCount")
    .count();

or (using groupByKey() to not change the key before the aggregation) 或(使用groupByKey()在聚合之前不更改密钥)

KTable<String, Long> wordCounts = textLines
    .selectKey((key, value) -> "totalWordCount")
    .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
    .groupByKey()
    .count();
@Configuration
@EnableKafkaStreams
public class FirstStreamApp {

@Bean
public KStream<String,String> process(StreamsBuilder builder){
    KStream<String,String> inputStream = builder.stream("streamIn", Consumed.with(Serdes.String(),Serdes.String()));
    KStream<String,String> upperCaseStream = inputStream.mapValues(value->value.toUpperCase());
   upperCaseStream.to("outTopic", Produced.with(Serdes.String(),Serdes.String()));

    KTable<String, Long> wordCounts = upperCaseStream.flatMapValues(v-> Arrays.asList(v.split(" "))).selectKey((k, v) -> v).groupByKey().
           count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>>as("counts-store"));
    wordCounts.toStream().to("wordCountTopic", Produced.with(Serdes.String(),Serdes.Long()));

    return upperCaseStream;
}

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM