How to convert a dataframe into a JSON file assigning a specific schema using pyspark?

Question

I am using pyspark and i want to convert a spark dataframe into a specific file json. the Dataframe is like this:

| Key  | desc | value |
|:---- |:----:| -----:|
| 12345| type | AA    |
| 12345| id   | q1w2e3|
| 98765| type | BB    |
| 98765| id   | z1x2c3|

I need to convert it into a json like this:

{
  "12345": {
     "type":"AA,
     "id":"q1w2e3"
    },
  "98765":{
     "type":"BB",
     "id":"z1x2c3"
    }
}

Any idea? Thank you

Answer 1

First collect the dataframe

Output = df.collect()

if you try to print the “Output” you will get List of Row Tuple something like this

[Row(key:1234,desc:type,value:AA)…..]

Now iterate over this list using for loop and Create dictionary and assign these value you can directly access them like this.

For row in Output:
     dict[key] = row[key]

once the dictionary is create then you can use Json.dumps(dict)