在给定messageId的流数据中缓冲消息

Question

Use case: i have messages having messageId, multiple messages can have same message id, these messages are present in streaming pipeline (like kafka) partitioned by messageId, so i am making sure all the messages with same messageId will go in same partition. 用例：我有具有messageId的消息，多个消息可以具有相同的消息ID，这些消息存在于由messageId分区的流传输管道（如kafka）中，因此我确保所有具有相同messageId的消息都将进入同一分区。

So i need to write a job which should buffer messages for some time (let say 1 minute) and after that time, combine all messages having same messageId to single large message. 因此，我需要编写一份应将消息缓冲一段时间（比如说1分钟）的作业，然后将所有具有相同messageId的消息合并为一条大消息。

I am thinking it can be done using spark Datasets and spark sql (or something else?). 我认为可以使用spark 数据集和spark sql（或其他方法？）完成此操作。 But i could not find any example/documentation around how to store messages for some time for a given message id and then do aggregation on these messages. 但是我找不到关于如何为给定消息ID一段时间存储消息，然后对这些消息进行聚合的任何示例/文档。

Answer 1

I think what you're looking for is Spark Streaming . 我认为您正在寻找的是Spark Streaming 。 Spark has a Kafka Connector that can link into a Spark Streaming Context. Spark有一个Kafka连接器，可以链接到Spark Streaming上下文。

Here's a really basic example that will create an RDD for all messages in a given set of topics over a 1 minute interval, then group them by a message id field (your value serializer would have to expose such a getMessageId method, of course). 这是一个非常基本的示例，它将在1分钟的间隔内为给定主题集中的所有消息创建RDD，然后将它们按消息ID字段进行分组（您的值序列化程序必须公开这种getMessageId方法）。

SparkConf conf = new SparkConf().setAppName(appName);
JavaStreamingContext ssc = new JavaStreamingContext(conf, Durations.minutes(1));

Map<String, Object> params = new HashMap<String, Object>() {{
    put("bootstrap.servers", kafkaServers);
    put("key.deserializer", kafkaKeyDeserializer);
    put("value.deserializer", kafkaValueDeserializer);
}};

List<String> topics = new ArrayList<String>() {{
    // Add Topics
}};

JavaInputDStream<ConsumerRecord<String, String>> stream =
    KafkaUtils.createDirectStream(ssc,
        LocationStrategies.PreferConsistent(),
        ConsumerStrategies.<String, String>Subscribe(topics, params)
    );

stream.foreachRDD(rdd -> rdd.groupBy(record -> record.value().getMessageId()));

ssc.start();
ssc.awaitTermination();

There's several other ways to group the messages within the streaming API. 还有其他几种在流API中对消息进行分组的方法。 Look at the documentation for more examples. 请参阅文档以获取更多示例。

在给定messageId的流数据中缓冲消息

问题描述

1 个解决方案

解决方案1
0 2018-02-12 18:59:44

在给定messageId的流数据中缓冲消息

问题描述

1 个解决方案

解决方案1 0 2018-02-12 18:59:44

解决方案1
0 2018-02-12 18:59:44