简体   繁体   English

在 Spark Streaming Java 中提取嵌套的 JSON 值

[英]Extracting nested JSON values in Spark Streaming Java

How should I parse json messages from Kafka in Spark Streaming?我应该如何在 Spark Streaming 中解析来自 Kafka 的 json 消息? I'm converting JavaRDD to Dataset and from there extracting the values.我正在将 JavaRDD 转换为数据集,然后从那里提取值。 Found success in extracting values however I'm not able to extract nested json values such as "host.name" and "fields.type".在提取值方面取得了成功,但是我无法提取嵌套的 json 值,例如“host.name”和“fields.type”。

Incoming message from Kafka:来自 Kafka 的传入消息:

{
  "@timestamp": "2020-03-03T10:48:03.160Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "_doc",
    "version": "7.6.0"
  },
  "host": {
    "name": "test.com"
  },
  "agent": {
    "id": "7651453414",
    "version": "7.6.0",
    "type": "filebeat",
    "ephemeral_id": "71983698531",
    "hostname": "test"
  },
  "message": "testing",
  "log": {
    "file": {
      "path": "/test.log"
    },
    "offset": 250553
  },
  "input": {
    "type": "log"
  },
  "fields": {
    "type": "test"
  },
  "ecs": {
    "version": "1.4.0"
  }
}

Spark code :火花代码:

StructField[] structFields = new StructField[] {
            new StructField("message", DataTypes.StringType, true, Metadata.empty()) };
StructType structType = new StructType(structFields);

StructField[] structFields2 = new StructField[] {
            new StructField("host", DataTypes.StringType, true, Metadata.empty()),
            new StructField("fields", DataTypes.StringType, true, Metadata.empty()),
            new StructField("message", DataTypes.StringType, true, Metadata.empty()) };
StructType structType2 = new StructType(structFields2);

JavaRDD<Row> rowRDD = rdd.map(new Function<ConsumerRecord<String, String>, Row>() {
        /**
         * 
         */
        private static final long serialVersionUID = -8817714250698168398L;

        @Override
        public Row call(ConsumerRecord<String, String> r) {
            Row row = RowFactory.create(r.value());
            return row;
        }
    });
    Dataset<Row> rowExtracted = spark.createDataFrame(rowRDD.rdd(), structType)
            .select(functions.from_json(functions.col("message"), structType2).as("data")).select("data.*");
    rowExtracted.printSchema();
    rowExtracted.show((int) rowExtracted.count(), false);

PrintSchema :打印模式:

root
 |-- host: string (nullable = true)
 |-- fields: string (nullable = true)
 |-- message: string (nullable = true)

Actual Output :实际输出:

+---------------+---------------+-------+
|host           |fields         |message|
+---------------+---------------+-------+
|{"name":"test"}|{"type":"test"}|testing|
+---------------+---------------+-------+

Expected Output :预期输出:

+---------------+---------------+-------+
|host           |fields         |message|
+---------------+---------------+-------+
|test           |test           |testing|
+---------------+---------------+-------+
StructField[] structFieldsName = new StructField[] {
            new StructField("name", DataTypes.StringType, true, Metadata.empty())
            };
StructType structTypeName = new StructType(structFieldsName);


StructField[] structFieldsType = new StructField[] {
            new StructField("type", DataTypes.StringType, true, Metadata.empty())
            };
StructType structTypeNested = new StructType(structFieldsType);

StructField[] structFieldsMsg = new StructField[] {
            new StructField("host", structTypeName , true, Metadata.empty()),
            new StructField("fields", structTypeNested, true, Metadata.empty()),
            new StructField("message", DataTypes.StringType, true, Metadata.empty())
            };
StructType structTypeMsg = new StructType(structFieldsMsg);

Dataset<Row> rowExtracted = spark.createDataFrame(rowRDD.rdd(), structTypeMsg)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM