在 Spark Streaming Java 中提取嵌套的 JSON 值

Question

我应该如何在 Spark Streaming 中解析来自 Kafka 的 json 消息？ 我正在将 JavaRDD 转换为数据集，然后从那里提取值。 在提取值方面取得了成功，但是我无法提取嵌套的 json 值，例如“host.name”和“fields.type”。

来自 Kafka 的传入消息：

{
  "@timestamp": "2020-03-03T10:48:03.160Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "_doc",
    "version": "7.6.0"
  },
  "host": {
    "name": "test.com"
  },
  "agent": {
    "id": "7651453414",
    "version": "7.6.0",
    "type": "filebeat",
    "ephemeral_id": "71983698531",
    "hostname": "test"
  },
  "message": "testing",
  "log": {
    "file": {
      "path": "/test.log"
    },
    "offset": 250553
  },
  "input": {
    "type": "log"
  },
  "fields": {
    "type": "test"
  },
  "ecs": {
    "version": "1.4.0"
  }
}

火花代码：

StructField[] structFields = new StructField[] {
            new StructField("message", DataTypes.StringType, true, Metadata.empty()) };
StructType structType = new StructType(structFields);

StructField[] structFields2 = new StructField[] {
            new StructField("host", DataTypes.StringType, true, Metadata.empty()),
            new StructField("fields", DataTypes.StringType, true, Metadata.empty()),
            new StructField("message", DataTypes.StringType, true, Metadata.empty()) };
StructType structType2 = new StructType(structFields2);

JavaRDD<Row> rowRDD = rdd.map(new Function<ConsumerRecord<String, String>, Row>() {
        /**
         * 
         */
        private static final long serialVersionUID = -8817714250698168398L;

        @Override
        public Row call(ConsumerRecord<String, String> r) {
            Row row = RowFactory.create(r.value());
            return row;
        }
    });
    Dataset<Row> rowExtracted = spark.createDataFrame(rowRDD.rdd(), structType)
            .select(functions.from_json(functions.col("message"), structType2).as("data")).select("data.*");
    rowExtracted.printSchema();
    rowExtracted.show((int) rowExtracted.count(), false);

打印模式：

root
 |-- host: string (nullable = true)
 |-- fields: string (nullable = true)
 |-- message: string (nullable = true)

实际输出：

+---------------+---------------+-------+
|host           |fields         |message|
+---------------+---------------+-------+
|{"name":"test"}|{"type":"test"}|testing|
+---------------+---------------+-------+

预期输出：

+---------------+---------------+-------+
|host           |fields         |message|
+---------------+---------------+-------+
|test           |test           |testing|
+---------------+---------------+-------+

Answer 1

StructField[] structFieldsName = new StructField[] {
            new StructField("name", DataTypes.StringType, true, Metadata.empty())
            };
StructType structTypeName = new StructType(structFieldsName);


StructField[] structFieldsType = new StructField[] {
            new StructField("type", DataTypes.StringType, true, Metadata.empty())
            };
StructType structTypeNested = new StructType(structFieldsType);

StructField[] structFieldsMsg = new StructField[] {
            new StructField("host", structTypeName , true, Metadata.empty()),
            new StructField("fields", structTypeNested, true, Metadata.empty()),
            new StructField("message", DataTypes.StringType, true, Metadata.empty())
            };
StructType structTypeMsg = new StructType(structFieldsMsg);

Dataset<Row> rowExtracted = spark.createDataFrame(rowRDD.rdd(), structTypeMsg)

在 Spark Streaming Java 中提取嵌套的 JSON 值

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-03-17 16:04:03

在 Spark Streaming Java 中提取嵌套的 JSON 值

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-03-17 16:04:03

解决方案1
0 已采纳 2020-03-17 16:04:03