[英]Not able to read json files: Spark Structured Streaming using java
我有一個python腳本,它每分鍾從NYSE的新文件(單行)中獲取股票數據(如下)。 它包含4種股票的數據-MSFT,ADBE,GOOGL和FB,如下json格式
[{"symbol": "MSFT", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "126.0800", "high": "126.1000", "low": "126.0500", "close": "126.0750", "volume": "57081"}}, {"symbol": "ADBE", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "279.2900", "high": "279.3400", "low": "279.2600", "close": "279.3050", "volume": "12711"}}, {"symbol": "GOOGL", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "1166.4100", "high": "1166.7400", "low": "1166.2900", "close": "1166.7400", "volume": "8803"}}, {"symbol": "FB", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "192.4200", "high": "192.5000", "low": "192.3600", "close": "192.4800", "volume": "33490"}}]
我正在嘗試將此文件流讀取到Spark Streaming數據幀中。 但是我無法為其定義適當的架構。 調查了互聯網,到目前為止已完成以下操作
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;
public class Driver1 {
public static void main(String args[]) throws InterruptedException, StreamingQueryException {
SparkSession session = SparkSession.builder().appName("Spark_Streaming").master("local[2]").getOrCreate();
Logger.getLogger("org").setLevel(Level.ERROR);
StructType priceData = new StructType()
.add("open", DataTypes.DoubleType)
.add("high", DataTypes.DoubleType)
.add("low", DataTypes.DoubleType)
.add("close", DataTypes.DoubleType)
.add("volume", DataTypes.LongType);
StructType schema = new StructType()
.add("symbol", DataTypes.StringType)
.add("timestamp", DataTypes.StringType)
.add("stock", priceData);
Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
rawData.printSchema();
rawData.writeStream().format("console").start().awaitTermination();
session.close();
}
}
我得到的輸出是-
root
|-- symbol: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- stock: struct (nullable = true)
| |-- open: double (nullable = true)
| |-- high: double (nullable = true)
| |-- low: double (nullable = true)
| |-- close: double (nullable = true)
| |-- volume: long (nullable = true)
-------------------------------------------
Batch: 0
-------------------------------------------
+------+-------------------+-----+
|symbol| timestamp|stock|
+------+-------------------+-----+
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
+------+-------------------+-----+
我什至嘗試過先將json字符串讀取為文本文件,然后再應用架構(就像通過Kafka-Streaming完成)...
Dataset<Row> rawData = session.readStream().format("text").load("/home/abhinavrawat/streamingData/data/*");
Dataset<Row> raw2 = rawData.select(org.apache.spark.sql.functions.from_json(rawData.col("value"),schema));
raw2.writeStream().format("console").start().awaitTermination();
獲取下面的輸出,在這種情況下,將rawData
數據幀作為字符串fromat中的json數據,
+--------------------+
|jsontostructs(value)|
+--------------------+
| null|
| null|
| null|
| null|
| null|
請幫我弄清楚。
剛弄清楚,請牢記以下兩點:
在定義架構時,請確保您命名和排序字段與json文件中的字段完全相同。
最初,僅對所有字段使用StringType
,您可以應用轉換將其更改回某些特定的數據類型。
這就是對我有用的
StructType priceData = new StructType()
.add("open", DataTypes.StringType)
.add("high", DataTypes.StringType)
.add("low", DataTypes.StringType)
.add("close", DataTypes.StringType)
.add("volume", DataTypes.StringType);
StructType schema = new StructType()
.add("symbol", DataTypes.StringType)
.add("timestamp", DataTypes.StringType)
.add("priceData", priceData);
Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
rawData.writeStream().format("console").start().awaitTermination();
session.close();
查看輸出-
+------+-------------------+--------------------+
|symbol| timestamp| priceData|
+------+-------------------+--------------------+
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
+------+-------------------+--------------------+
現在,您可以使用priceData.open
, priceData.close
等來展平priceData列。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.