[英]Extracting nested JSON values in Spark Streaming Java
我应该如何在 Spark Streaming 中解析来自 Kafka 的 json 消息? 我正在将 JavaRDD 转换为数据集,然后从那里提取值。 在提取值方面取得了成功,但是我无法提取嵌套的 json 值,例如“host.name”和“fields.type”。
来自 Kafka 的传入消息:
{
"@timestamp": "2020-03-03T10:48:03.160Z",
"@metadata": {
"beat": "filebeat",
"type": "_doc",
"version": "7.6.0"
},
"host": {
"name": "test.com"
},
"agent": {
"id": "7651453414",
"version": "7.6.0",
"type": "filebeat",
"ephemeral_id": "71983698531",
"hostname": "test"
},
"message": "testing",
"log": {
"file": {
"path": "/test.log"
},
"offset": 250553
},
"input": {
"type": "log"
},
"fields": {
"type": "test"
},
"ecs": {
"version": "1.4.0"
}
}
火花代码:
StructField[] structFields = new StructField[] {
new StructField("message", DataTypes.StringType, true, Metadata.empty()) };
StructType structType = new StructType(structFields);
StructField[] structFields2 = new StructField[] {
new StructField("host", DataTypes.StringType, true, Metadata.empty()),
new StructField("fields", DataTypes.StringType, true, Metadata.empty()),
new StructField("message", DataTypes.StringType, true, Metadata.empty()) };
StructType structType2 = new StructType(structFields2);
JavaRDD<Row> rowRDD = rdd.map(new Function<ConsumerRecord<String, String>, Row>() {
/**
*
*/
private static final long serialVersionUID = -8817714250698168398L;
@Override
public Row call(ConsumerRecord<String, String> r) {
Row row = RowFactory.create(r.value());
return row;
}
});
Dataset<Row> rowExtracted = spark.createDataFrame(rowRDD.rdd(), structType)
.select(functions.from_json(functions.col("message"), structType2).as("data")).select("data.*");
rowExtracted.printSchema();
rowExtracted.show((int) rowExtracted.count(), false);
打印模式:
root
|-- host: string (nullable = true)
|-- fields: string (nullable = true)
|-- message: string (nullable = true)
实际输出:
+---------------+---------------+-------+
|host |fields |message|
+---------------+---------------+-------+
|{"name":"test"}|{"type":"test"}|testing|
+---------------+---------------+-------+
预期输出:
+---------------+---------------+-------+
|host |fields |message|
+---------------+---------------+-------+
|test |test |testing|
+---------------+---------------+-------+
StructField[] structFieldsName = new StructField[] {
new StructField("name", DataTypes.StringType, true, Metadata.empty())
};
StructType structTypeName = new StructType(structFieldsName);
StructField[] structFieldsType = new StructField[] {
new StructField("type", DataTypes.StringType, true, Metadata.empty())
};
StructType structTypeNested = new StructType(structFieldsType);
StructField[] structFieldsMsg = new StructField[] {
new StructField("host", structTypeName , true, Metadata.empty()),
new StructField("fields", structTypeNested, true, Metadata.empty()),
new StructField("message", DataTypes.StringType, true, Metadata.empty())
};
StructType structTypeMsg = new StructType(structFieldsMsg);
Dataset<Row> rowExtracted = spark.createDataFrame(rowRDD.rdd(), structTypeMsg)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.