![](/img/trans.png)
[英]Kafka Spark Streaming: How to run Spark SQL query on multiple tables created by spark steaming?
[英]Spark steaming read from Kafka and apply Spark SQL aggregations in Java
我有一个Spark作业,该作业从数据库读取数据并应用Spark SQL聚合。 代码如下(仅省略conf选项):
SparkConf sparkConf = new SparkConf().setAppName(appName).setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
sqlContext = new SQLContext(sc);
Dataset df = MongoSpark.read(sqlContext).options(readOptions).load();
df.registerTempTable("data");
df.cache();
aggregators = sqlContext.sql(myQuery);
现在,我想创建另一个作业,该作业通过Spark流从Kafka读取消息,然后通过Spark SQL应用相同的聚合。 到目前为止,我的代码如下:
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "192.168.99.100:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", KafkaStatisticsPayloadDeserializer.class);
kafkaParams.put("group.id", "Group1");
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList(topic);
SparkConf conf = new SparkConf().setAppName(topic).setMaster("local");
/*
* Spark streaming context
*/
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(2));
/*
* Create an input DStream for Receiving data from socket
*/
JavaInputDStream<ConsumerRecord<String, StatisticsRecord>> stream =
KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, StatisticsRecord>Subscribe(topics, kafkaParams)
);
到目前为止,我已经成功阅读并反序列化了消息。 所以我的问题是如何实际在其上应用Spark SQL聚合。 我尝试了以下操作,但不起作用。 我认为我需要以某种方式首先隔离包含实际消息的“值”字段。
SQLContext sqlContext = new SQLContext(streamingContext.sparkContext());
stream.foreachRDD(rdd -> {
Dataset<Row> df = sqlContext.createDataFrame(rdd.rdd(), StatisticsRecord.class);
df.createOrReplaceTempView("data");
df.cache();
Dataset aggregators = sqlContext.sql(SQLContextAggregations.ORDER_TYPE_DB);
aggregators.show();
});
您应该在应用于流的函数内调用上下文。
我已经用以下代码解决了这个问题。 请注意,现在我以JSON格式而不是实际对象存储消息。
SparkConf conf = new SparkConf().setAppName(topic).setMaster("local");
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(2));
SparkSession spark = SparkSession.builder().appName(topic).getOrCreate();
/*
* Kafka conf
*/
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", dbUri);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "Group4");
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("Statistics");
/*
* Create an input DStream for Receiving data from socket
*/
JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
/*
* Keep only the actual message in JSON format
*/
JavaDStream<String> recordStream = stream.flatMap(record -> Arrays.asList(record.value()).iterator());
/*
* Extract RDDs from stream and apply aggregation in each one
*/
recordStream.foreachRDD(rdd -> {
if (rdd.count() > 0) {
Dataset<Row> df = spark.read().json(rdd.rdd());
df.createOrReplaceTempView("data");
df.cache();
Dataset aggregators = spark.sql(SQLContextAggregations.ORDER_TYPE_DB);
aggregators.show();
}
});
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.