简体   繁体   English

如何使用Java从Spark中读取kafka中的流嵌套JSON

[英]How to read stream nested JSON from kafka in Spark using Java

I'm trying to read complex nested JSON data from kafka in spark using Java and having trouble in forming the Dataset 我正在尝试使用Java从spark中读取kafka中的复杂嵌套JSON数据,并且无法形成数据集

Actual JSON file sent to kafka 发送到kafka的实际JSON文件

{"sample_title": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}
{"sample_title2": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}
{"sample_title3": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}
Dataset<Row> df = spark.readStream().format("kafka")
                    .option("spark.local.dir", config.getString(PropertyKeys.SPARK_APPLICATION_TEMP_LOCATION.getCode()))
                    .option("kafka.bootstrap.servers",
                            config.getString(PropertyKeys.KAFKA_BOORTSTRAP_SERVERS.getCode()))
                    .option("subscribe", config.getString(PropertyKeys.KAFKA_TOPIC_IPE_STP.getCode()))
                    .option("startingOffsets", "earliest")
                    .option("spark.default.parallelism",
                            config.getInt(PropertyKeys.SPARK_APPLICATION_DEFAULT_PARALLELISM_VALUE.getCode()))
                    .option("spark.sql.shuffle.partitions",
                            config.getInt(PropertyKeys.SPARK_APPLICATION_SHUFFLE_PARTITIONS_COUNT.getCode()))
                    .option("kafka.security.protocol", config.getString(PropertyKeys.SECURITY_PROTOCOL.getCode()))
                    .option("kafka.ssl.truststore.location",
                            config.getString(PropertyKeys.SSL_TRUSTSTORE_LOCATION.getCode()))
                    .option("kafka.ssl.truststore.password",
                            config.getString(PropertyKeys.SSL_TRUSTSTORE_PASSWORD.getCode()))
                    .option("kafka.ssl.keystore.location",
                            config.getString(PropertyKeys.SSL_KEYSTORE_LOCATION.getCode()))
                    .option("kafka.ssl.keystore.password",
                            config.getString(PropertyKeys.SSL_KEYSTORE_PASSWORD.getCode()))
                    .option("kafka.ssl.key.password", config.getString(PropertyKeys.SSL_KEY_PASSWORD.getCode())).load()
                    .selectExpr("CAST(key AS STRING)",
                            "CAST(value AS STRING)",
                            "topic as topic",
                            "partition as partition","offset as offset",
                            "timestamp as timestamp",
                            "timestampType as timestampType");

val output =  df.selectExpr("CAST(value AS STRING)").as(Encoders.STRING()).filter(x -> x.contains("sample_title"));

As I can have multiple schema in the input , the code should be able to handle that and filter according to the title and map to Dataset of type Title 因为我可以在输入中有多个模式,代码应该能够处理它并根据标题过滤并映射到Title类型的数据集

public class Title implements Serializable {
    String txn_date;
    Timestamp timestamp;
    String txn_type;
    String txn_rcvd_time;
    String txn_ref;
    String txn_status;
}

First make class Title a java bean class ie, write getters and setter. 首先使类标题成为java bean类,即编写getter和setter。

    public class Title implements Serializable {
        String txn_date;
        Timestamp timestamp;
        String txn_type;
        String txn_rcvd_time;
        String txn_ref;
        String txn_status;
        public Title(String data){... //set values for fields with the data}
        // add all getters and setters for fields
    }

    Dataset<Title> resultdf = df.selectExpr("CAST(value AS STRING)").map(value -> new Title(value), Encoders.bean(Title.class))
resultdf.filter(title -> // apply any predicate on title)

if you want to filter the data first and then apply encoding, 如果要先过滤数据然后应用编码,

    df.selectExpr("CAST(value AS STRING)")
.filter(get_json_object(col("value"), "$.sample_title").isNotNull)
// for simple filter use, .filter(t-> t.contains("sample_title"))
.map(value -> new Title(value), Encoders.bean(Title.class))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark中如何将Kafka流转成Json格式来解析Java语言中的数据 - How to convert Kafka stream into Json format in Spark to parse the data in Java language 从Kafka读取Spark Steaming并在Java中应用Spark SQL聚合 - Spark steaming read from Kafka and apply Spark SQL aggregations in Java Java使用Apache Spark从json文件读取并指定了模式 - Java read from json file using Apache Spark specifying the Schema 如何使用java动态展平spark数据帧中复杂的嵌套json - how to flatten complex nested json in spark dataframe using java dynamically 如何使用 Java 8 Stream 将嵌套的 json 展平为 map? - How to flatten nested json to map with entity using Java 8 Stream? 有没有办法在 java 中从头开始使用 Kafka 流(而不是通过 KafkaConsumer)读取消息? - Is there a way to read messages using Kafka stream(not via KafkaConsumer) from beginning everytime in java? 如何使用Spark结构化流为Kafka流实现自定义反序列化器? - How to implement custom deserializer for Kafka stream using Spark structured streaming? 如何使用直接流在Kafka Spark Streaming中指定使用者组 - how to specify consumer group in Kafka Spark Streaming using direct stream 使用Java从镶木地板文件中读取嵌套的JSON - Read nested JSON from parquet file using Java Spark Streaming-Java-将Kafka中的JSON插入Cassandra - Spark Streaming - Java - Insert JSON from Kafka into Cassandra
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM