[英]How to read a Json file with a specific format with Spark Scala?
I'm trying to read a Json file which is like : 我正在尝试读取一个Json文件,如:
[
{"IFAM":"EQR","KTM":1430006400000,"COL":21,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"31","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"30","Nrout":"5","up":null,"Crate":"2"}
,{"MLrate":"34","Nrout":"0","up":null,"Crate":"4"}
,{"MLrate":"33","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"30","Nrout":"8","up":null,"Crate":"2"}
]}
,{"IFAM":"EQR","KTM":1430006400000,"COL":22,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"30","Nrout":"0","up":null,"Crate":"0"}
,{"MLrate":"35","Nrout":"1","up":null,"Crate":"5"}
,{"MLrate":"30","Nrout":"6","up":null,"Crate":"2"}
,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"38","Nrout":"8","up":null,"Crate":"1"}
]}
,...
]
I've tried the command: 我试过这个命令:
val df = sqlContext.read.json("namefile")
df.show()
But this does not work : my columns are not recognized... 但这不起作用:我的专栏无法识别......
If you want to use read.json
you need a single JSON document per line. 如果要使用
read.json
,则每行需要一个JSON文档。 If your file contains a valid JSON array with documents it simply won't work as expected. 如果您的文件包含带有文档的有效JSON数组,则它将无法按预期工作。 For example if we take your example data input file should be formatted like this:
例如,如果我们采用您的示例数据输入文件应格式如下:
{"IFAM":"EQR","KTM":1430006400000,"COL":21,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}, {"MLrate":"31","Nrout":"0","up":null,"Crate":"2"}, {"MLrate":"30","Nrout":"5","up":null,"Crate":"2"} ,{"MLrate":"34","Nrout":"0","up":null,"Crate":"4"} ,{"MLrate":"33","Nrout":"0","up":null,"Crate":"2"} ,{"MLrate":"30","Nrout":"8","up":null,"Crate":"2"} ]}
{"IFAM":"EQR","KTM":1430006400000,"COL":22,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"0"} ,{"MLrate":"35","Nrout":"1","up":null,"Crate":"5"} ,{"MLrate":"30","Nrout":"6","up":null,"Crate":"2"} ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} ,{"MLrate":"38","Nrout":"8","up":null,"Crate":"1"} ]}
If you use read.json
on above structure you'll see it is parsed as expected: 如果你在上面的结构上使用
read.json
,你会看到它按预期解析:
scala> sqlContext.read.json("namefile").printSchema
root
|-- COL: long (nullable = true)
|-- DATA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Crate: string (nullable = true)
| | |-- MLrate: string (nullable = true)
| | |-- Nrout: string (nullable = true)
| | |-- up: string (nullable = true)
|-- IFAM: string (nullable = true)
|-- KTM: long (nullable = true)
If you don't want to format your JSON file (line by line) you could create a schema using StructType and MapType using SparkSQL functions 如果您不想格式化JSON文件(逐行),可以使用SparkSQL函数使用StructType和MapType创建模式
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
// Convenience function for turning JSON strings into DataFrames
def jsonToDataFrame(json: String, schema: StructType = null):
DataFrame = {
val reader = spark.read
Option(schema).foreach(reader.schema)
reader.json(sc.parallelize(Array(json)))
}
// Using a struct
val schema = new StructType().add("a", new StructType().add("b", IntegerType))
// call the function passing the sample JSON data and the schema as parameter
val json_df = jsonToDataFrame("""
{
"a": {
"b": 1
}
} """, schema)
// now you can access your json fields
val b_value = json_df.select("a.b")
b_value.show()
See this reference documentation for more examples and details https://docs.databricks.com/spark/latest/spark-sql/complex-types.html#transform-complex-data-types-scala 有关更多示例和详细信息,请参阅此参考文档https://docs.databricks.com/spark/latest/spark-sql/complex-types.html#transform-complex-data-types-scala
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.