在 Spark Streaming Java 中提取嵌套的 JSON 值

Question

How should I parse json messages from Kafka in Spark Streaming?我应该如何在 Spark Streaming 中解析来自 Kafka 的 json 消息？ I'm converting JavaRDD to Dataset and from there extracting the values.我正在将 JavaRDD 转换为数据集，然后从那里提取值。 Found success in extracting values however I'm not able to extract nested json values such as "host.name" and "fields.type".在提取值方面取得了成功，但是我无法提取嵌套的 json 值，例如“host.name”和“fields.type”。

Incoming message from Kafka:来自 Kafka 的传入消息：

{
  "@timestamp": "2020-03-03T10:48:03.160Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "_doc",
    "version": "7.6.0"
  },
  "host": {
    "name": "test.com"
  },
  "agent": {
    "id": "7651453414",
    "version": "7.6.0",
    "type": "filebeat",
    "ephemeral_id": "71983698531",
    "hostname": "test"
  },
  "message": "testing",
  "log": {
    "file": {
      "path": "/test.log"
    },
    "offset": 250553
  },
  "input": {
    "type": "log"
  },
  "fields": {
    "type": "test"
  },
  "ecs": {
    "version": "1.4.0"
  }
}

Spark code :火花代码：

StructField[] structFields = new StructField[] {
            new StructField("message", DataTypes.StringType, true, Metadata.empty()) };
StructType structType = new StructType(structFields);

StructField[] structFields2 = new StructField[] {
            new StructField("host", DataTypes.StringType, true, Metadata.empty()),
            new StructField("fields", DataTypes.StringType, true, Metadata.empty()),
            new StructField("message", DataTypes.StringType, true, Metadata.empty()) };
StructType structType2 = new StructType(structFields2);

JavaRDD<Row> rowRDD = rdd.map(new Function<ConsumerRecord<String, String>, Row>() {
        /**
         * 
         */
        private static final long serialVersionUID = -8817714250698168398L;

        @Override
        public Row call(ConsumerRecord<String, String> r) {
            Row row = RowFactory.create(r.value());
            return row;
        }
    });
    Dataset<Row> rowExtracted = spark.createDataFrame(rowRDD.rdd(), structType)
            .select(functions.from_json(functions.col("message"), structType2).as("data")).select("data.*");
    rowExtracted.printSchema();
    rowExtracted.show((int) rowExtracted.count(), false);

PrintSchema :打印模式：

root
 |-- host: string (nullable = true)
 |-- fields: string (nullable = true)
 |-- message: string (nullable = true)

Actual Output :实际输出：

+---------------+---------------+-------+
|host           |fields         |message|
+---------------+---------------+-------+
|{"name":"test"}|{"type":"test"}|testing|
+---------------+---------------+-------+

Expected Output :预期输出：

+---------------+---------------+-------+
|host           |fields         |message|
+---------------+---------------+-------+
|test           |test           |testing|
+---------------+---------------+-------+

Answer 1

StructField[] structFieldsName = new StructField[] {
            new StructField("name", DataTypes.StringType, true, Metadata.empty())
            };
StructType structTypeName = new StructType(structFieldsName);


StructField[] structFieldsType = new StructField[] {
            new StructField("type", DataTypes.StringType, true, Metadata.empty())
            };
StructType structTypeNested = new StructType(structFieldsType);

StructField[] structFieldsMsg = new StructField[] {
            new StructField("host", structTypeName , true, Metadata.empty()),
            new StructField("fields", structTypeNested, true, Metadata.empty()),
            new StructField("message", DataTypes.StringType, true, Metadata.empty())
            };
StructType structTypeMsg = new StructType(structFieldsMsg);

Dataset<Row> rowExtracted = spark.createDataFrame(rowRDD.rdd(), structTypeMsg)

在 Spark Streaming Java 中提取嵌套的 JSON 值

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-03-17 16:04:03

在 Spark Streaming Java 中提取嵌套的 JSON 值

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-03-17 16:04:03

解决方案1
0 已采纳 2020-03-17 16:04:03