I have JSON files describing a table structure. I want to read each file from S3 as a single String in order to then apply a fromJson()
method of apache.spark.sql.types.DataType
DataType.fromJson(jsonString).asInstanceOf[StructType]
But for now I only managed to read the files into a DataFrame:
val testJsonData = sqlContext.read.option("multiline", "true").json("/s3Bucket/metrics/metric1.json")
But I don't need a df.schema
, instead I need to parse the contents of a JSON string to a StructType.
The contents of a JSON file:
{
"type" : "struct",
"fields" : [ {
"name" : "metric_name",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "metric_time",
"type" : "long",
"nullable" : true,
"metadata" : { }
}, {
"name" : "metric_value",
"type" : "string",
"nullable" : true,
"metadata" : { }
}]
}
It looks like what you want to use is sc.wholeTextFiles
(sc is a SparkContext
in this case).
This results in an RDD[(String, String)]
where ._1
is the file name, and ._2
is the entire file content. Maybe you can try:
val files = sc.wholeTextFiles("/s3Bucket/metrics/", 16).toDS()
files.map(DataType.fromJson(_._2).asInstanceOf[StructType])
Which, in theory, would give you an Dataset[StructType]
. Unfortunately, I'm not finding a similar function in the pure spark sql API, but this may work.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.