如何使用Java從Spark中讀取kafka中的流嵌套JSON

Question

我正在嘗試使用Java從spark中讀取kafka中的復雜嵌套JSON數據，並且無法形成數據集

發送到kafka的實際JSON文件

{"sample_title": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}
{"sample_title2": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}
{"sample_title3": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}

Dataset<Row> df = spark.readStream().format("kafka")
                    .option("spark.local.dir", config.getString(PropertyKeys.SPARK_APPLICATION_TEMP_LOCATION.getCode()))
                    .option("kafka.bootstrap.servers",
                            config.getString(PropertyKeys.KAFKA_BOORTSTRAP_SERVERS.getCode()))
                    .option("subscribe", config.getString(PropertyKeys.KAFKA_TOPIC_IPE_STP.getCode()))
                    .option("startingOffsets", "earliest")
                    .option("spark.default.parallelism",
                            config.getInt(PropertyKeys.SPARK_APPLICATION_DEFAULT_PARALLELISM_VALUE.getCode()))
                    .option("spark.sql.shuffle.partitions",
                            config.getInt(PropertyKeys.SPARK_APPLICATION_SHUFFLE_PARTITIONS_COUNT.getCode()))
                    .option("kafka.security.protocol", config.getString(PropertyKeys.SECURITY_PROTOCOL.getCode()))
                    .option("kafka.ssl.truststore.location",
                            config.getString(PropertyKeys.SSL_TRUSTSTORE_LOCATION.getCode()))
                    .option("kafka.ssl.truststore.password",
                            config.getString(PropertyKeys.SSL_TRUSTSTORE_PASSWORD.getCode()))
                    .option("kafka.ssl.keystore.location",
                            config.getString(PropertyKeys.SSL_KEYSTORE_LOCATION.getCode()))
                    .option("kafka.ssl.keystore.password",
                            config.getString(PropertyKeys.SSL_KEYSTORE_PASSWORD.getCode()))
                    .option("kafka.ssl.key.password", config.getString(PropertyKeys.SSL_KEY_PASSWORD.getCode())).load()
                    .selectExpr("CAST(key AS STRING)",
                            "CAST(value AS STRING)",
                            "topic as topic",
                            "partition as partition","offset as offset",
                            "timestamp as timestamp",
                            "timestampType as timestampType");

val output =  df.selectExpr("CAST(value AS STRING)").as(Encoders.STRING()).filter(x -> x.contains("sample_title"));

因為我可以在輸入中有多個模式，代碼應該能夠處理它並根據標題過濾並映射到Title類型的數據集

public class Title implements Serializable {
    String txn_date;
    Timestamp timestamp;
    String txn_type;
    String txn_rcvd_time;
    String txn_ref;
    String txn_status;
}

Answer 1

首先使類標題成為java bean類，即編寫getter和setter。

    public class Title implements Serializable {
        String txn_date;
        Timestamp timestamp;
        String txn_type;
        String txn_rcvd_time;
        String txn_ref;
        String txn_status;
        public Title(String data){... //set values for fields with the data}
        // add all getters and setters for fields
    }

    Dataset<Title> resultdf = df.selectExpr("CAST(value AS STRING)").map(value -> new Title(value), Encoders.bean(Title.class))
resultdf.filter(title -> // apply any predicate on title)

如果要先過濾數據然后應用編碼，

    df.selectExpr("CAST(value AS STRING)")
.filter(get_json_object(col("value"), "$.sample_title").isNotNull)
// for simple filter use, .filter(t-> t.contains("sample_title"))
.map(value -> new Title(value), Encoders.bean(Title.class))

如何使用Java從Spark中讀取kafka中的流嵌套JSON

問題描述

1 個解決方案

解決方案1
0 已采納 2019-04-02 19:00:57

如何使用Java從Spark中讀取kafka中的流嵌套JSON

問題描述

1 個解決方案

解決方案1 0 已采納 2019-04-02 19:00:57

解決方案1
0 已采納 2019-04-02 19:00:57