简体   繁体   中英

Extract Schema from nested Json-String column in Pyspark

Assuming I have the following table:

body
{"Day":1,"vals":[{"id":"1", "val":"3"}], {"id":"2", "val":"4"}}

My goal is to write down the schema in Pyspark for this nested json column. I've tried the following two things:

schema = StructType([
  StructField("Day", StringType()),
  StructField(
  "vals",
  StructType([
    StructType([
      StructField("id", StringType(), True),
      StructField("val", DoubleType(), True)
    ])
    StructType([
      StructField("id", StringType(), True),
      StructField("val", DoubleType(), True)
    ])
  ])
  )
])

Here I get the error that of

'StructType' object has no attribute 'name'

Another approach was to declare the nested Arrays as ArrayType:

schema = StructType([
  StructField("Day", StringType()),
  StructField(
  "vals",
  ArrayType(
    ArrayType(
        StructField("id", StringType(), True),
        StructField("val", DoubleType(), True)
      , True)
    ArrayType(
        StructField("id", StringType(), True),
        StructField("val", DoubleType(), True)
      , True)
    , True)
  )
])

Here I get the following error:

takes from 2 to 3 positional arguments but 5 were given

Which propably comes from the array only taking the Sql type as an argument.

Can anybody tell me what their approach would be to create the schema, since I'm a super newbie to the whole topic..

Your second nested StructType needs a name:

schema = StructType([StructField("Day", DoubleType()), 
                 StructField("vals", StructType([StructField("id",StringType()), StructField("val", DoubleType())])),
                 StructField("vals2", StructType([StructField("id",StringType()), StructField("val", DoubleType())]))
                ])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM