简体   繁体   中英

Pyspark get Schema from JSON file

I am trying to get Pyspark schema from a JSON file but when I am creating the schema using the variable in the Python code, I am able to see the variable type of <class 'pyspark.sql.types.StructType'> but when I am trying to get through JSON file it's showing type of unicode .

Is there any way to get pyspark schema through JSON file?

JSON file Content:

{                                                                                                                                                                                                
"tediasessionclose_schema" : "StructType([ StructField('@timestamp', StringType()), StructField('message' , StructType([ StructField('componentAddress', StringType()), StructField('values', StructType([ StructField('confNum', StringType()), StructField('day', IntegerType())])"                                                                                                                                                         
}

Pyspark Code:

df = sc.read.json(hdfs_path, schema = jsonfile['tediasessionclose_schema'])

You can obtain the schema by evaluating the string that you get from reading the json:

import json
from pyspark.sql.types import StructField, StringType, IntegerType, StructType

with open('test.json') as f:
    data = json.load(f)

df = sqlContext.createDataFrame([], schema = eval(data['tediasessionclose_schema']))
print(df.schema)

outputs:

StructType(List(StructField(@timestamp,StringType,true),StructField(message,StructType(List(StructField(componentAddress,StringType,true),StructField(values,StructType(List(StructField(confNum,StringType,true),StructField(day,IntegerType,true))),true))),true)))

where test.json is:

{"tediasessionclose_schema" : "StructType([ StructField('@timestamp', StringType()), StructField('message' , StructType([ StructField('componentAddress', StringType()), StructField('values', StructType([ StructField('confNum', StringType()), StructField('day', IntegerType())]))]))])"}

Hope this helps!

config_json file:

{"json_data_schema": ["contactId", "firstName", "lastName"]}

PySpark Application :

schema = StructType().add("contactId", StringType()).add("firstName", StringType()).add("lastName", StringType())

Reference: https://www.python-course.eu/lambda.php

schema = StructType()
schema = map(lambda x: schema.add(x, StringType(), True), (data["json_data_schema"]))[0][0:]

Hope this solution works for you!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM