[英]How to read stream nested JSON from kafka in Spark using Java
I'm trying to read complex nested JSON data from kafka in spark using Java and having trouble in forming the Dataset 我正在尝试使用Java从spark中读取kafka中的复杂嵌套JSON数据,并且无法形成数据集
Actual JSON file sent to kafka 发送到kafka的实际JSON文件
{"sample_title": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}
{"sample_title2": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}
{"sample_title3": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}
Dataset<Row> df = spark.readStream().format("kafka")
.option("spark.local.dir", config.getString(PropertyKeys.SPARK_APPLICATION_TEMP_LOCATION.getCode()))
.option("kafka.bootstrap.servers",
config.getString(PropertyKeys.KAFKA_BOORTSTRAP_SERVERS.getCode()))
.option("subscribe", config.getString(PropertyKeys.KAFKA_TOPIC_IPE_STP.getCode()))
.option("startingOffsets", "earliest")
.option("spark.default.parallelism",
config.getInt(PropertyKeys.SPARK_APPLICATION_DEFAULT_PARALLELISM_VALUE.getCode()))
.option("spark.sql.shuffle.partitions",
config.getInt(PropertyKeys.SPARK_APPLICATION_SHUFFLE_PARTITIONS_COUNT.getCode()))
.option("kafka.security.protocol", config.getString(PropertyKeys.SECURITY_PROTOCOL.getCode()))
.option("kafka.ssl.truststore.location",
config.getString(PropertyKeys.SSL_TRUSTSTORE_LOCATION.getCode()))
.option("kafka.ssl.truststore.password",
config.getString(PropertyKeys.SSL_TRUSTSTORE_PASSWORD.getCode()))
.option("kafka.ssl.keystore.location",
config.getString(PropertyKeys.SSL_KEYSTORE_LOCATION.getCode()))
.option("kafka.ssl.keystore.password",
config.getString(PropertyKeys.SSL_KEYSTORE_PASSWORD.getCode()))
.option("kafka.ssl.key.password", config.getString(PropertyKeys.SSL_KEY_PASSWORD.getCode())).load()
.selectExpr("CAST(key AS STRING)",
"CAST(value AS STRING)",
"topic as topic",
"partition as partition","offset as offset",
"timestamp as timestamp",
"timestampType as timestampType");
val output = df.selectExpr("CAST(value AS STRING)").as(Encoders.STRING()).filter(x -> x.contains("sample_title"));
As I can have multiple schema in the input , the code should be able to handle that and filter according to the title and map to Dataset of type Title 因为我可以在输入中有多个模式,代码应该能够处理它并根据标题过滤并映射到Title类型的数据集
public class Title implements Serializable {
String txn_date;
Timestamp timestamp;
String txn_type;
String txn_rcvd_time;
String txn_ref;
String txn_status;
}
First make class Title a java bean class ie, write getters and setter. 首先使类标题成为java bean类,即编写getter和setter。
public class Title implements Serializable {
String txn_date;
Timestamp timestamp;
String txn_type;
String txn_rcvd_time;
String txn_ref;
String txn_status;
public Title(String data){... //set values for fields with the data}
// add all getters and setters for fields
}
Dataset<Title> resultdf = df.selectExpr("CAST(value AS STRING)").map(value -> new Title(value), Encoders.bean(Title.class))
resultdf.filter(title -> // apply any predicate on title)
if you want to filter the data first and then apply encoding, 如果要先过滤数据然后应用编码,
df.selectExpr("CAST(value AS STRING)")
.filter(get_json_object(col("value"), "$.sample_title").isNotNull)
// for simple filter use, .filter(t-> t.contains("sample_title"))
.map(value -> new Title(value), Encoders.bean(Title.class))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.