无法读取json文件：使用Java的Spark结构化流

Question

我有一个python脚本，它每分钟从NYSE的新文件（单行）中获取股票数据（如下）。 它包含4种股票的数据-MSFT，ADBE，GOOGL和FB，如下json格式

[{"symbol": "MSFT", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "126.0800", "high": "126.1000", "low": "126.0500", "close": "126.0750", "volume": "57081"}}, {"symbol": "ADBE", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "279.2900", "high": "279.3400", "low": "279.2600", "close": "279.3050", "volume": "12711"}}, {"symbol": "GOOGL", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "1166.4100", "high": "1166.7400", "low": "1166.2900", "close": "1166.7400", "volume": "8803"}}, {"symbol": "FB", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "192.4200", "high": "192.5000", "low": "192.3600", "close": "192.4800", "volume": "33490"}}]

我正在尝试将此文件流读取到Spark Streaming数据帧中。 但是我无法为其定义适当的架构。 调查了互联网，到目前为止已完成以下操作

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;



public class Driver1 {

    public static void main(String args[]) throws InterruptedException, StreamingQueryException {


        SparkSession session = SparkSession.builder().appName("Spark_Streaming").master("local[2]").getOrCreate();
        Logger.getLogger("org").setLevel(Level.ERROR);


        StructType priceData = new StructType()
                .add("open", DataTypes.DoubleType)
                .add("high", DataTypes.DoubleType)
                .add("low", DataTypes.DoubleType)
                .add("close", DataTypes.DoubleType)
                .add("volume", DataTypes.LongType);

        StructType schema = new StructType()
                .add("symbol", DataTypes.StringType)
                .add("timestamp", DataTypes.StringType)
                .add("stock", priceData);


        Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
        rawData.printSchema();
        rawData.writeStream().format("console").start().awaitTermination();
        session.close();        

    }

}

我得到的输出是-

root
 |-- symbol: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- stock: struct (nullable = true)
 |    |-- open: double (nullable = true)
 |    |-- high: double (nullable = true)
 |    |-- low: double (nullable = true)
 |    |-- close: double (nullable = true)
 |    |-- volume: long (nullable = true)

-------------------------------------------
Batch: 0
-------------------------------------------
+------+-------------------+-----+
|symbol|          timestamp|stock|
+------+-------------------+-----+
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
+------+-------------------+-----+

我什至尝试过先将json字符串读取为文本文件，然后再应用架构（就像通过Kafka-Streaming完成）...

  Dataset<Row> rawData = session.readStream().format("text").load("/home/abhinavrawat/streamingData/data/*");
    Dataset<Row> raw2 = rawData.select(org.apache.spark.sql.functions.from_json(rawData.col("value"),schema)); 
raw2.writeStream().format("console").start().awaitTermination();

获取下面的输出，在这种情况下，将rawData数据帧作为字符串fromat中的json数据，

+--------------------+
|jsontostructs(value)|
+--------------------+
|                null|
|                null|
|                null|
|                null|
|                null|

请帮我弄清楚。

Answer 1

刚弄清楚，请牢记以下两点：

在定义架构时，请确保您命名和排序字段与json文件中的字段完全相同。
最初，仅对所有字段使用StringType ，您可以应用转换将其更改回某些特定的数据类型。

这就是对我有用的

    StructType priceData = new StructType()
            .add("open", DataTypes.StringType)
            .add("high", DataTypes.StringType)
            .add("low", DataTypes.StringType)
            .add("close", DataTypes.StringType)
            .add("volume", DataTypes.StringType);

    StructType schema = new StructType()
            .add("symbol", DataTypes.StringType)
            .add("timestamp", DataTypes.StringType)
            .add("priceData", priceData);


    Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
    rawData.writeStream().format("console").start().awaitTermination();
    session.close();

查看输出-

+------+-------------------+--------------------+
|symbol|          timestamp|           priceData|
+------+-------------------+--------------------+
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
+------+-------------------+--------------------+

现在，您可以使用priceData.open ， priceData.close等来展平priceData列。

无法读取json文件：使用Java的Spark结构化流

问题描述

1 个解决方案

解决方案1
1 2019-05-04 07:08:49

无法读取json文件：使用Java的Spark结构化流

问题描述

1 个解决方案

解决方案1 1 2019-05-04 07:08:49

解决方案1
1 2019-05-04 07:08:49