简体   繁体   English

在给定messageId的流数据中缓冲消息

[英]Buffer messages in stream data for a given messageId

Use case: i have messages having messageId, multiple messages can have same message id, these messages are present in streaming pipeline (like kafka) partitioned by messageId, so i am making sure all the messages with same messageId will go in same partition. 用例:我有具有messageId的消息,多个消息可以具有相同的消息ID,这些消息存在于由messageId分区的流传输管道(如kafka)中,因此我确保所有具有相同messageId的消息都将进入同一分区。

So i need to write a job which should buffer messages for some time (let say 1 minute) and after that time, combine all messages having same messageId to single large message. 因此,我需要编写一份应将消息缓冲一段时间(比如说1分钟)的作业,然后将所有具有相同messageId的消息合并为一条大消息。

I am thinking it can be done using spark Datasets and spark sql (or something else?). 我认为可以使用spark 数据集和spark sql(或其他方法?)完成此操作。 But i could not find any example/documentation around how to store messages for some time for a given message id and then do aggregation on these messages. 但是我找不到关于如何为给定消息ID一段时间存储消息,然后对这些消息进行聚合的任何示例/文档。

I think what you're looking for is Spark Streaming . 我认为您正在寻找的是Spark Streaming Spark has a Kafka Connector that can link into a Spark Streaming Context. Spark有一个Kafka连接器 ,可以链接到Spark Streaming上下文。

Here's a really basic example that will create an RDD for all messages in a given set of topics over a 1 minute interval, then group them by a message id field (your value serializer would have to expose such a getMessageId method, of course). 这是一个非常基本的示例,它将在1分钟的间隔内为给定主题集中的所有消息创建RDD,然后将它们按消息ID字段进行分组(您的值序列化程序必须公开这种getMessageId方法)。

SparkConf conf = new SparkConf().setAppName(appName);
JavaStreamingContext ssc = new JavaStreamingContext(conf, Durations.minutes(1));

Map<String, Object> params = new HashMap<String, Object>() {{
    put("bootstrap.servers", kafkaServers);
    put("key.deserializer", kafkaKeyDeserializer);
    put("value.deserializer", kafkaValueDeserializer);
}};

List<String> topics = new ArrayList<String>() {{
    // Add Topics
}};

JavaInputDStream<ConsumerRecord<String, String>> stream =
    KafkaUtils.createDirectStream(ssc,
        LocationStrategies.PreferConsistent(),
        ConsumerStrategies.<String, String>Subscribe(topics, params)
    );

stream.foreachRDD(rdd -> rdd.groupBy(record -> record.value().getMessageId()));

ssc.start();
ssc.awaitTermination(); 

There's several other ways to group the messages within the streaming API. 还有其他几种在流API中对消息进行分组的方法。 Look at the documentation for more examples. 请参阅文档以获取更多示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM