繁体   English   中英

将 pyspark dataframe 转换为嵌套的 json ZA8CFDE6331BD59EB2AC96F8911C4B66

[英]Converting a pyspark dataframe to a nested json object

我有一个火花数据框如下

----------------------------------------------------------------------------
| item_id |   popular_tags   | popularity_score
____________________________________________________________________________
| id_1        Samsung         0.4
| id_1        long battery    0.8
| id_2        Apple           0.9
| id_2        UI              0.9
_____________________________________________________________________________

我想通过item_id和 output 将此数据框分组为一个文件,每行是json object

{id_1: {"Samsung":{"popularity_score":0.4}, "long_battery":{"popularity_score": 0.8}}}
{id_2: {"Apple": {"popularity_score": 0.9},"UI":{"popularity_score":0.9}}}

我尝试使用to_jsoncollect_list函数,但得到的列表不是嵌套的 json object。 这是一个大型分布式 dataframe,因此不能转换为 pandas 或将其收集到单个机器中。

您需要为您的 JSON 创建一些 map 类型:

import pyspark.sql.functions as F

df2 = df.groupBy('item_id').agg(
    F.map_from_entries(
        F.collect_list(
            F.struct('popular_tags', F.struct('popularity_score'))
        )
    ).alias('m')
).select(
    F.to_json(
        F.create_map('item_id', 'm')
    ).alias('col')
)

df2.show(truncate=False)
+-------------------------------------------------------------------------------------+
|col                                                                                  |
+-------------------------------------------------------------------------------------+
|{"id_2":{"Apple":{"popularity_score":0.9},"UI":{"popularity_score":0.9}}}            |
|{"id_1":{"Samsung":{"popularity_score":0.4},"long battery":{"popularity_score":0.8}}}|
+-------------------------------------------------------------------------------------+

如果没有map_from_entries ,您可能不得不依赖一些肮脏的技巧:

df2 = df.groupBy('item_id').agg(
    F.collect_list(
        F.create_map('popular_tags', F.struct('popularity_score'))
    ).alias('m')
).select(
    F.regexp_replace(
        F.regexp_replace(
            F.to_json(F.create_map('item_id', 'm')),
            '(\\[|\\])', 
            ''
        ),
    '\\},\\{', 
    ','
    ).alias('col')
)

df2.show(truncate=False)
+-------------------------------------------------------------------------------------+
|col                                                                                  |
+-------------------------------------------------------------------------------------+
|{"id_2":{"Apple":{"popularity_score":0.9},"UI":{"popularity_score":0.9}}}            |
|{"id_1":{"Samsung":{"popularity_score":0.4},"long battery":{"popularity_score":0.8}}}|
+-------------------------------------------------------------------------------------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM