[英]Spark - Convert a coulmn containing JSON string from StringType to ArrayType(StringType())
我有一個 dataframe df,其中包含 json 字符串,如下所示,
'''[{"@id":"Party_1","@ObjectID":"Policy_1"},{"@id":"Party_2","@ObjectID":"Policy_2"},{"@id":"Party_3","@ObjectID":"Policy_3"}]'''
df 架構:
root
|-- col1: string (nullable = true)
如何將其轉換為字符串數組(ArrayType(StringType()))?
結果應該是這樣的,
['{"@id":"Party_1","@OriginatingObjectID":"Policy_1"}',
'{"@id":"Party_2","@OriginatingObjectID":"Policy_2"}',
'{"@id":"Party_3","@OriginatingObjectID":"Policy_3"}']
結果架構:
root
|-- arr_col: array (nullable = true)
| |-- element: string (containsNull = true)
任何幫助,將不勝感激。 謝謝!
您可以使用 from_json function 獲取 json 字段,對值稍作修改,如下所示
data = [
('[{"@id":"Party_1","@ObjectID":"Policy_1"},{"@id":"Party_2","@ObjectID":"Policy_2"},{"@id":"Party_3","@ObjectID":"Policy_3"}]', 2767),
('[{"@id":"Party_1","@ObjectID":"Policy_1"},{"@id":"Party_2","@ObjectID":"Policy_2"},{"@id":"Party_3","@ObjectID":"Policy_3"}]', 4235)
]
df = spark.createDataFrame(data).toDF(*["value", "count"])\
.withColumn("value", f.regexp_replace(f.col("value"), "\\[\\{", "{\"arr\": [{"))\
.withColumn("value", f.regexp_replace(f.col("value"), "\\}\\]", "}]}"))
json_schema = spark.read.json(df.rdd.map(lambda row: row.value)).schema
resultDF = df.select(f.from_json("value",
schema=json_schema).alias("array_col"))\
.select("array_col.*")
resultDF.printSchema()
resultDF.show(truncate=False)
或者,如果您想將嵌套的 json 作為字符串,您可以使用自定義架構。
Output 架構:
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- @ObjectID: string (nullable = true)
| | |-- @id: string (nullable = true)
Output:
+---------------------------------------------------------------+
|arr |
+---------------------------------------------------------------+
|[{Policy_1, Party_1}, {Policy_2, Party_2}, {Policy_3, Party_3}]|
|[{Policy_1, Party_1}, {Policy_2, Party_2}, {Policy_3, Party_3}]|
+---------------------------------------------------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.