I have created a PySpark application that reads the JSON file in a dataframe through a defined Schema. code sample below
schema = StructType([
StructField("domain", StringType(), True),
StructField("timestamp", LongType(), True),
])
df= sqlContext.read.json(file, schema)
I need a way to find how can I define this schema in a kind of config or ini file etc. And read that in the main the PySpark application.
This will help me to modify schema for the changing JSON if there is any need in future without changing the main PySpark code.
StructType
provides json
and jsonValue
methods which can be used to obtain json
and dict
representation respectively and fromJson
which can be used to convert Python dictionary to StructType
.
schema = StructType([
StructField("domain", StringType(), True),
StructField("timestamp", LongType(), True),
])
StructType.fromJson(schema.jsonValue())
The only thing you need beyond that is built-in json
module to parse input to the dict
that can be consumed by StructType
.
For Scala version see How to create a schema from CSV file and persist/save that schema to a file?
You can create a JSON file named schema.json in the below format
{
"fields": [
{
"metadata": {},
"name": "first_fields",
"nullable": true,
"type": "string"
},
{
"metadata": {},
"name": "double_field",
"nullable": true,
"type": "double"
}
],
"type": "struct"
}
Create a struct schema from reading this file
rdd = spark.sparkContext.wholeTextFiles("s3://<bucket>/schema.json")
text = rdd.collect()[0][1]
dict = json.loads(str(text))
custom_schema = StructType.fromJson(dict)
After that, you can use struct as a schema to read the JSON file
val df=spark.read.json("path", custom_schema)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.