將列中的字符串作為嵌套 JSON 存儲到 JSON 文件 - Pyspark

Question

我有一個 pyspark 數據框，這就是它的樣子

+------------------------------------+-------------------+-------------+--------------------------------+---------+
|member_uuid                         |Timestamp          |updated      |member_id                       |easy_id  |
+------------------------------------+-------------------+-------------+--------------------------------+---------+
|027130fe-584d-4d8e-9fb0-b87c984a0c20|2020-02-11 19:15:32|password_hash|ajuypjtnlzmk4na047cgav27jma6_STG|993269700|

我將上面的數據框轉換成這個，

 +---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|attribute|operation|params                                                                                                                                           |timestamp          |
+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|profile  |UPDATE   |{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":993269700,"field":"password_hash"}|2020-02-11 19:15:32|

使用以下代碼，

ll = ['member_uuid', 'member_id', 'easy_id', 'field']
df = df.withColumn('timestamp', col('Timestamp')).withColumn('attribute', lit('profile')).withColumn('operation', lit(col_name)) \
                    .withColumn('field', col('updated')).withColumn('params', F.to_json(struct([x for x in ll])))
    df = df.select('attribute', 'operation', 'params', 'timestamp')

將其轉換為 JSON 后，我已將此數據幀 df 保存到文本文件中。 我嘗試使用以下代碼來做同樣的事情，

df_final.toJSON().coalesce(1).saveAsTextFile('file')

該文件包含，

{"attribute":"profile","operation":"UPDATE","params":"{\"member_uuid\":\"027130fe-584d-4d8e-9fb0-b87c984a0c20\",\"member_id\":\"ajuypjtnlzmk4na047cgav27jma6_STG\",\"easy_id\":993269700,\"field\":\"password_hash\"}","timestamp":"2020-02-11T19:15:32.000Z"}

我希望它以這種格式保存，

{"attribute":"profile","operation":"UPDATE","params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":993269700,"field":"password_hash"},"timestamp":"2020-02-11T19:15:32.000Z"}

to_json 將 params 列中的值保存為字符串，有沒有辦法在此處保留 json 上下文，以便我可以將其保存為所需的輸出？

Answer 1

不要使用to_json在數據to_json創建params列。

這里的技巧只是創建結構並寫入文件（使用.saveAsTextFile （或） .write.json() ）Spark 將為 Struct 字段創建 JSON。
如果我們已經創建了json對象並以 json 格式寫入 Spark 將添加\\以escape Json 字符串中已經存在的quotes 。

Example:

from pyspark.sql.functions import *

#sample data
df=spark.createDataFrame([("027130fe-584d-4d8e-9fb0-b87c984a0c20","2020-02-11 19:15:32","password_hash","ajuypjtnlzmk4na047cgav27jma6_STG","993269700")],["member_uuid","Timestamp","updated","member_id","easy_id"])

df1=df.withColumn("attribute",lit("profile")).withColumn("operation",lit("UPDATE"))

df1.selectExpr("struct(member_uuid,member_id,easy_id) as params","attribute","operation","timestamp").write.format("json").mode("overwrite").save("<path>")

#{"params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":"993269700"},"attribute":"profile","operation":"UPDATE","timestamp":"2020-02-11 19:15:32"}

df1.selectExpr("struct(member_uuid,member_id,easy_id) as params","attribute","operation","timestamp").toJSON().saveAsTextFile("<path>")

#{"params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":"993269700"},"attribute":"profile","operation":"UPDATE","timestamp":"2020-02-11 19:15:32"}

Answer 2

處理它的一種簡單方法是對文件進行替換操作

sourceData=open('file').read().replace('"{','{').replace('}"','}').replace('\\','')
with open('file','w') as final:
    final.write(sourceData)

這可能不是您想要的，但會達到最終結果。

將列中的字符串作為嵌套 JSON 存儲到 JSON 文件 - Pyspark

問題描述

2 個解決方案

解決方案1
1 已采納 2020-03-31 14:30:22

解決方案2
0 2020-03-31 05:45:11

將列中的字符串作為嵌套 JSON 存儲到 JSON 文件 - Pyspark

問題描述

2 個解決方案

解決方案1 1 已采納 2020-03-31 14:30:22

解決方案2 0 2020-03-31 05:45:11

解決方案1
1 已采納 2020-03-31 14:30:22

解決方案2
0 2020-03-31 05:45:11