[英]Converting a pyspark dataframe to a nested json object
我有一个火花数据框如下
----------------------------------------------------------------------------
| item_id | popular_tags | popularity_score
____________________________________________________________________________
| id_1 Samsung 0.4
| id_1 long battery 0.8
| id_2 Apple 0.9
| id_2 UI 0.9
_____________________________________________________________________________
我想通过item_id
和 output 将此数据框分组为一个文件,每行是json
object
{id_1: {"Samsung":{"popularity_score":0.4}, "long_battery":{"popularity_score": 0.8}}}
{id_2: {"Apple": {"popularity_score": 0.9},"UI":{"popularity_score":0.9}}}
我尝试使用to_json
和collect_list
函数,但得到的列表不是嵌套的 json object。 这是一个大型分布式 dataframe,因此不能转换为 pandas 或将其收集到单个机器中。
您需要为您的 JSON 创建一些 map 类型:
import pyspark.sql.functions as F
df2 = df.groupBy('item_id').agg(
F.map_from_entries(
F.collect_list(
F.struct('popular_tags', F.struct('popularity_score'))
)
).alias('m')
).select(
F.to_json(
F.create_map('item_id', 'm')
).alias('col')
)
df2.show(truncate=False)
+-------------------------------------------------------------------------------------+
|col |
+-------------------------------------------------------------------------------------+
|{"id_2":{"Apple":{"popularity_score":0.9},"UI":{"popularity_score":0.9}}} |
|{"id_1":{"Samsung":{"popularity_score":0.4},"long battery":{"popularity_score":0.8}}}|
+-------------------------------------------------------------------------------------+
如果没有map_from_entries
,您可能不得不依赖一些肮脏的技巧:
df2 = df.groupBy('item_id').agg(
F.collect_list(
F.create_map('popular_tags', F.struct('popularity_score'))
).alias('m')
).select(
F.regexp_replace(
F.regexp_replace(
F.to_json(F.create_map('item_id', 'm')),
'(\\[|\\])',
''
),
'\\},\\{',
','
).alias('col')
)
df2.show(truncate=False)
+-------------------------------------------------------------------------------------+
|col |
+-------------------------------------------------------------------------------------+
|{"id_2":{"Apple":{"popularity_score":0.9},"UI":{"popularity_score":0.9}}} |
|{"id_1":{"Samsung":{"popularity_score":0.4},"long battery":{"popularity_score":0.8}}}|
+-------------------------------------------------------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.