简体   繁体   English

将 spark 的 dataframe 列转换为 json object

[英]Convert spark's dataframe columns to json object

I have a dataframe with following data我有一个带有以下数据的 dataframe

+-----------+-------|-----|
|file_name  | key   |Value|
+-----------+-------+-----+
| file1     | key1  | 7   |
| file1     | key2  | 11  |
| file1     | key3  | 3   |
| file2     | key1  | 9   |
| file2     | key2  | 2   |
| file2     | key3  | 10  |
+-----------+-------+-----+

With following code I have solved one step of my problem使用以下代码,我已经解决了我的问题的一步

dataset.select(col("file_name"), to_json(struct(col("key").alias("key"),col("value").alias("value"))).alias("output"))
       .groupBy(col("file_name")).agg(collect_list(col("output")).alias("output"))
       .show(false);

Which is giving me output like this -这给了我这样的 output -

+-----------+-------------------------------------------------------------------------------------|
|file_name  | output                                                                              |
+-----------+-------------------------------------------------------------------------------------|
| file1     |[{"key":"key1","value":"7"}, {"key":"key2","value":"11"}, {"key":"key3","value":"3"}]|
| file2     |[{"key":"key1","value":"9"}, {"key":"key2","value":"2"}, {"key":"key3","value":"10"}]|
+-----------+-------------------------------------------------------------------------------------|

But I want my final output in following json structure.我希望我的最终 output 遵循 json 结构。 Can you please suggest me any changes to get the output in following format (json object holding json array).您能否建议我进行任何更改以获取以下格式的 output(json object 持有 json 数组)。

+-----------+----------------------------------------------------------------------------------------------|
|file_name  | output                                                                                       |
+-----------+----------------------------------------------------------------------------------------------|
| file1     |{"result":[{"key":"key1","value":"7"},{"key":"key2","value":"11"},{"key":"key3","value":"3"}]}|
| file2     |{"result":[{"key":"key1","value":"9"},{"key":"key2","value":"2"},{"key":"key3","value":"10"}]}|
+-----------+----------------------------------------------------------------------------------------------|

Try adding another select statement: select(col("file_name"), to_json(struct(col("output").alias("result"))).alias("output"))尝试添加另一个select语句: select(col("file_name"), to_json(struct(col("output").alias("result"))).alias("output"))

The code should be something like:代码应该是这样的:


  dataset.select(col("file_name"), to_json(struct(col("key").alias("key"),col("value").alias("value"))).alias("output"))
       .groupBy(col("file_name")).agg(collect_list(col("output")).alias("output"))
       .select(col("file_name"), to_json(struct(col("output").alias("result"))).alias("output"))
       .show(false);

You can put the result inside a struct before calling to_json .您可以在调用to_json之前将结果放入结构中。 Note that you shouldn't call to_json twice because that will result in doubly escaped quotes.请注意,您不应调用to_json两次,因为这将导致双引号。

dataset.groupBy("file_name").agg(
    to_json(
        struct(
            collect_list(struct("key", "value")).alias("result")
        )
    ).alias("output")
).show(false)

+---------+----------------------------------------------------------------------------------------------+
|file_name|output                                                                                        |
+---------+----------------------------------------------------------------------------------------------+
|file2    |{"result":[{"key":"key1","value":"9"},{"key":"key2","value":"2"},{"key":"key3","value":"10"}]}|
|file1    |{"result":[{"key":"key1","value":"7"},{"key":"key2","value":"11"},{"key":"key3","value":"3"}]}|
+---------+----------------------------------------------------------------------------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM