简体   繁体   中英

Convert spark's dataframe columns to json object

I have a dataframe with following data

+-----------+-------|-----|
|file_name  | key   |Value|
+-----------+-------+-----+
| file1     | key1  | 7   |
| file1     | key2  | 11  |
| file1     | key3  | 3   |
| file2     | key1  | 9   |
| file2     | key2  | 2   |
| file2     | key3  | 10  |
+-----------+-------+-----+

With following code I have solved one step of my problem

dataset.select(col("file_name"), to_json(struct(col("key").alias("key"),col("value").alias("value"))).alias("output"))
       .groupBy(col("file_name")).agg(collect_list(col("output")).alias("output"))
       .show(false);

Which is giving me output like this -

+-----------+-------------------------------------------------------------------------------------|
|file_name  | output                                                                              |
+-----------+-------------------------------------------------------------------------------------|
| file1     |[{"key":"key1","value":"7"}, {"key":"key2","value":"11"}, {"key":"key3","value":"3"}]|
| file2     |[{"key":"key1","value":"9"}, {"key":"key2","value":"2"}, {"key":"key3","value":"10"}]|
+-----------+-------------------------------------------------------------------------------------|

But I want my final output in following json structure. Can you please suggest me any changes to get the output in following format (json object holding json array).

+-----------+----------------------------------------------------------------------------------------------|
|file_name  | output                                                                                       |
+-----------+----------------------------------------------------------------------------------------------|
| file1     |{"result":[{"key":"key1","value":"7"},{"key":"key2","value":"11"},{"key":"key3","value":"3"}]}|
| file2     |{"result":[{"key":"key1","value":"9"},{"key":"key2","value":"2"},{"key":"key3","value":"10"}]}|
+-----------+----------------------------------------------------------------------------------------------|

Try adding another select statement: select(col("file_name"), to_json(struct(col("output").alias("result"))).alias("output"))

The code should be something like:


  dataset.select(col("file_name"), to_json(struct(col("key").alias("key"),col("value").alias("value"))).alias("output"))
       .groupBy(col("file_name")).agg(collect_list(col("output")).alias("output"))
       .select(col("file_name"), to_json(struct(col("output").alias("result"))).alias("output"))
       .show(false);

You can put the result inside a struct before calling to_json . Note that you shouldn't call to_json twice because that will result in doubly escaped quotes.

dataset.groupBy("file_name").agg(
    to_json(
        struct(
            collect_list(struct("key", "value")).alias("result")
        )
    ).alias("output")
).show(false)

+---------+----------------------------------------------------------------------------------------------+
|file_name|output                                                                                        |
+---------+----------------------------------------------------------------------------------------------+
|file2    |{"result":[{"key":"key1","value":"9"},{"key":"key2","value":"2"},{"key":"key3","value":"10"}]}|
|file1    |{"result":[{"key":"key1","value":"7"},{"key":"key2","value":"11"},{"key":"key3","value":"3"}]}|
+---------+----------------------------------------------------------------------------------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM