简体   繁体   English

Kafka主题的JSON中没有发生结构化流 - 流连接

[英]Structured Stream-Stream join is not happening in JSON from Kafka topic

Application listening 2 kafka topics 应用程序听2卡夫卡主题

  1. userevent userevent

  2. paymentevent paymentevent

Payload for userevent Payload for userevent

{"userId":"Id_223","firstname":"fname_223","lastname":"lname_223","phonenumber":"P98202384_223","usertimestamp":"Apr 5, 2019 2:58:47 PM"}

Payload for paymentevent payloadtevent的有效负载

{"paymentUserId":"Id_227","amount":1227.0,"location":"location_227","paymenttimestamp":"Apr 5, 2019 3:00:03 PM"}

Based on userId=paymentuserid , We need to merge the record. 基于userId = paymentuserid,我们需要合并记录。

Its seems like application is not able to parse the record from Kafka topic. 似乎应用程序无法解析Kafka主题的记录。

There must something on from_json I am missing. 必须有一些东西on_json我不见了。

Can someone provide early feedback? 有人可以提供早期反馈吗?

Here is console output without any join happening. 这是控制台输出,没有任何连接发生。 no records. 没有记录。

+------+---------+--------+-----------+-------------+-------------+------+--------+----------------+
|userId|firstname|lastname|phonenumber|usertimestamp|paymentuserId|amount|location|paymenttimestamp|
+------+---------+--------+-----------+-------------+-------------+------+--------+----------------+
+------+---------+--------+-----------+-------------+-------------+------+--------+----------------+

Here is code. 这是代码。

import org.apache.spark.SparkConf;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import static org.apache.spark.sql.functions.expr;

@SpringBootApplication
public class Stream2StreamJoin  implements CommandLineRunner{



    private static final Logger LOGGER =
              LoggerFactory.getLogger(Stream2StreamJoin.class);

    @Value("${kafka.bootstrap.server}")
    private String bootstrapServers;

    @Value("${kafka.userevent}")
    private String usereventTopic;

    @Value("${kafka.paymentevent}")
    private String paymenteventTopic;

    public void processData() {

        System.out.println(bootstrapServers);
        System.out.println(usereventTopic);
        System.out.println(paymenteventTopic);

        LOGGER.info(bootstrapServers);
        LOGGER.info(usereventTopic);
        LOGGER.info(paymenteventTopic);


        SparkConf sparkConf = new SparkConf().setAppName("Stream2StreamJoin").setMaster("local[*]");

        JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(10));



        SparkSession spark = SparkSession
                  .builder()
                  .appName("Stream2StreamJoin")
                  .getOrCreate();

        spark.sparkContext().setLogLevel("ERROR");

        StructType userSchema =  DataTypes.createStructType(new StructField[] { 
                DataTypes.createStructField("userId", DataTypes.StringType, true),
                DataTypes.createStructField("firstname", DataTypes.StringType, true),
                DataTypes.createStructField("lastname", DataTypes.StringType, true),
                DataTypes.createStructField("phonenumber", DataTypes.StringType, true),
                DataTypes.createStructField("usertimestamp", DataTypes.TimestampType, true)
                });


        StructType paymentSchema =  DataTypes.createStructType(new StructField[] { 
                DataTypes.createStructField("paymentuserId", DataTypes.StringType, true),
                DataTypes.createStructField("amount", DataTypes.StringType, true),
                DataTypes.createStructField("location", DataTypes.StringType, true),                
                DataTypes.createStructField("paymenttimestamp", DataTypes.TimestampType, true)
                });



        Dataset<Row> userDataSet=spark.readStream().format("kafka")
                  .option("kafka.bootstrap.servers", bootstrapServers)
                  .option("subscribe", usereventTopic)
                  .option("startingOffsets", "earliest")
                  .load().selectExpr("CAST(value  AS STRING) as userEvent")
                     .select(functions.from_json(functions.col("userEvent"),userSchema).as("user"))
                     .select("user.*")
                     ; 



        Dataset<Row> paymentDataSet=spark.readStream().format("kafka")
                  .option("kafka.bootstrap.servers", bootstrapServers)
                  .option("subscribe", paymenteventTopic)
                  .option("startingOffsets", "earliest")
                  .load().selectExpr("CAST( value AS STRING) as paymentEvent")
                     .select(functions.from_json(functions.col("paymentEvent"),paymentSchema).as("payment"))
                     .select("payment.*")
                     ;

        Dataset<Row> userDataSetWithWatermark = userDataSet.withWatermark("usertimestamp", "2 hours");

        Dataset<Row> paymentDataSetWithWatermark = paymentDataSet.withWatermark("paymenttimestamp", "3 hours");

        Dataset<Row> joindataSet =  userDataSetWithWatermark.join(
                paymentDataSetWithWatermark,
                  expr(
                          "userId = paymentuserId AND usertimestamp >= paymenttimestamp AND usertimestamp <= paymenttimestamp + interval 1 hour")
                );

        joindataSet.writeStream().format("console").start();



        try {

            spark.streams().awaitAnyTermination();
        } catch (StreamingQueryException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }



        }

    @Override
    public void run(String... args) throws Exception {
        processData();

    }

    public static void main(String[] args) throws Exception {

        System.setProperty("hadoop.home.dir", "/Users/workspace/java/spark-kafka-streaming");

        SpringApplication.run(Stream2StreamJoin.class, args);
    }

}

Solved the problem using jackson library on event producer instead of google gson library. 使用事件生成器上的jackson库而不是google gson库解决了这个问题。

Consumer side its not able to understand what is json object receiving from topic. 消费者方面无法理解从主题接收的json对象是什么。

~Keep Learning Keep Growing 〜继续学习继续成长

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Kafka Streams 中流流连接的默认 WindowBytesStoreSupplier 是什么? - What is the default WindowBytesStoreSupplier for stream-stream join in Kafka Streams? 消费者不读取来自 Kafka 主题的消息(Akka Stream Kafka) - Consumer does not read messages from the Kafka topic (Akka Stream Kafka) 卡夫卡流加入 - Kafka stream join Spring Cloud Stream确定主题Kafka消息来自 - Spring Cloud Stream determine topic Kafka message came from 来自一个 Kafka 主题源的并发 Spark stream 作业 - Concurrent Spark stream job from one Kafka topic source 如何使用 kafka 流加入主题 - How to join topics with kafka stream 使用Apache Kafka进行流连接示例? - Stream join example with Apache Kafka? 2个连续的流-流内部连接产生错误的结果:流之间的KStream连接在内部真正做了什么? - 2 consecutive stream-stream inner joins produce wrong results: what does KStream join between streams really do internally? 从主 kafka 流暂停消息消费并从其他 kafka 主题开始 - Pause message consumption from main kafka stream and start from other kafka topic 如何使用Spark结构化流为Kafka流实现自定义反序列化器? - How to implement custom deserializer for Kafka stream using Spark structured streaming?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM