Not able to read json files: Spark Structured Streaming using java

Question

I have a python script which is getting stock data(as below) from NYSE every minute in a new file(single line). It contains data of 4 stocks - MSFT, ADBE, GOOGL and FB, as the below json format

[{"symbol": "MSFT", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "126.0800", "high": "126.1000", "low": "126.0500", "close": "126.0750", "volume": "57081"}}, {"symbol": "ADBE", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "279.2900", "high": "279.3400", "low": "279.2600", "close": "279.3050", "volume": "12711"}}, {"symbol": "GOOGL", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "1166.4100", "high": "1166.7400", "low": "1166.2900", "close": "1166.7400", "volume": "8803"}}, {"symbol": "FB", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "192.4200", "high": "192.5000", "low": "192.3600", "close": "192.4800", "volume": "33490"}}]

I'm trying to read this file stream into a Spark Streaming dataframe. But I'm not able to define the proper schema for it. Looked into the internet and done the following so far

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;



public class Driver1 {

    public static void main(String args[]) throws InterruptedException, StreamingQueryException {


        SparkSession session = SparkSession.builder().appName("Spark_Streaming").master("local[2]").getOrCreate();
        Logger.getLogger("org").setLevel(Level.ERROR);


        StructType priceData = new StructType()
                .add("open", DataTypes.DoubleType)
                .add("high", DataTypes.DoubleType)
                .add("low", DataTypes.DoubleType)
                .add("close", DataTypes.DoubleType)
                .add("volume", DataTypes.LongType);

        StructType schema = new StructType()
                .add("symbol", DataTypes.StringType)
                .add("timestamp", DataTypes.StringType)
                .add("stock", priceData);


        Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
        rawData.printSchema();
        rawData.writeStream().format("console").start().awaitTermination();
        session.close();        

    }

}

The output I'm getting is this-

root
 |-- symbol: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- stock: struct (nullable = true)
 |    |-- open: double (nullable = true)
 |    |-- high: double (nullable = true)
 |    |-- low: double (nullable = true)
 |    |-- close: double (nullable = true)
 |    |-- volume: long (nullable = true)

-------------------------------------------
Batch: 0
-------------------------------------------
+------+-------------------+-----+
|symbol|          timestamp|stock|
+------+-------------------+-----+
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
|  MSFT|2019-05-02 15:59:00| null|
|  ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
|    FB|2019-05-02 15:59:00| null|
+------+-------------------+-----+

I have even tried first reading the json string as a text file and then applying the schema(like it is done with the Kafka-Streaming)...

  Dataset<Row> rawData = session.readStream().format("text").load("/home/abhinavrawat/streamingData/data/*");
    Dataset<Row> raw2 = rawData.select(org.apache.spark.sql.functions.from_json(rawData.col("value"),schema)); 
raw2.writeStream().format("console").start().awaitTermination();

Getting below output, in this case, the rawData dataframe as the json data in string fromat,

+--------------------+
|jsontostructs(value)|
+--------------------+
|                null|
|                null|
|                null|
|                null|
|                null|

Please help me figure it out.

Answer 1

Just figured it out, Keep the following two things in mind-

While defining the schema make sure you name and order the fields exactly the same as in your json file.
Initially, use only StringType for all your fields, you can apply a transformation to change it back to some specific data type.

This is what worked for me-

    StructType priceData = new StructType()
            .add("open", DataTypes.StringType)
            .add("high", DataTypes.StringType)
            .add("low", DataTypes.StringType)
            .add("close", DataTypes.StringType)
            .add("volume", DataTypes.StringType);

    StructType schema = new StructType()
            .add("symbol", DataTypes.StringType)
            .add("timestamp", DataTypes.StringType)
            .add("priceData", priceData);


    Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
    rawData.writeStream().format("console").start().awaitTermination();
    session.close();

See the output-

+------+-------------------+--------------------+
|symbol|          timestamp|           priceData|
+------+-------------------+--------------------+
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
|  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
|  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
|    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
+------+-------------------+--------------------+

You can now flatten the priceData column using priceData.open , priceData.close etc.

Not able to read json files: Spark Structured Streaming using java

Question

1 answers

solution1
1 2019-05-04 07:08:49

Not able to read json files: Spark Structured Streaming using java

Question

1 answers

solution1 1 2019-05-04 07:08:49

solution1
1 2019-05-04 07:08:49