在 pyspark 模式中將 json 列作為字符串打開並使用它

Question

我有一個很大的 dataframe 我無法從中推斷出架構。 我有一列可以讀取，好像每個值都是 json 格式，但我不知道它的全部細節（即鍵和值可以變化，我不知道它可以是什么）。

我想將它作為字符串讀取並使用它，但在此過程中格式會以一種奇怪的方式發生變化； 這是一個例子：

from pyspark.sql.types import *

data = [{"ID": 1, "Value": {"a":12, "b": "test"}},
        {"ID": 2, "Value": {"a":13, "b": "test2"}}
        ]
df = spark.createDataFrame(data)


#change my schema to open the column as string
schema = df.schema
j = schema.jsonValue()
j["fields"][1] = {"name": "Value", "type": "string", "nullable": True, "metadata": {}}
new_schema = StructType.fromJson(j)

df2 = spark.createDataFrame(data, schema=new_schema)
df2.show()

給我

+---+---------------+
| ID|          Value|
+---+---------------+
|  1| {a=12, b=test}|
|  2|{a=13, b=test2}|
+---+---------------+

如您所見，“值”列中的格式現在沒有引號，並且使用“=”而不是“：”，我無法再正常使用它了。 如何將其轉回 StructType 或 MapType？

Answer 1

假設這是您的輸入 dataframe：

df2 = spark.createDataFrame([
    (1, "{a=12, b=test}"), (2, "{a=13, b=test2}")
], ["ID", "Value"])

從字符串列中刪除{}后，您可以使用str_to_map function，如下所示：

from pyspark.sql import functions as F

df = df2.withColumn(
    "Value",
    F.regexp_replace("Value", "[{}]", "")
).withColumn(
    "Value",
    F.expr("str_to_map(Value, ', ', '=')")
)

df.printSchema()
#root
# |-- ID: long (nullable = true)
# |-- Value: map (nullable = true)
# |    |-- key: string
# |    |-- value: string (valueContainsNull = true)

df.show()
#+---+---------------------+
#|ID |Value                |
#+---+---------------------+
#|1  |{a -> 12, b -> test} |
#|2  |{a -> 13, b -> test2}|
#+---+---------------------+

在 pyspark 模式中將 json 列作為字符串打開並使用它

問題描述

1 個解決方案

解決方案1
0 2022-02-01 10:59:23

在 pyspark 模式中將 json 列作為字符串打開並使用它

問題描述

1 個解決方案

解決方案1 0 2022-02-01 10:59:23

解決方案1
0 2022-02-01 10:59:23