簡體   English   中英

將 JSON 字符串列拆分為多個無模式的列 - PySpark

[英]Split JSON string column to multiple columns without schema - PySpark

我有一個增量表,其中有一列包含 JSON 數據。 我沒有它的模式,需要一種方法將 JSON 數據轉換為列

|id | json_data
| 1 | {"name":"abc", "depts":["dep01", "dep02"]}
| 2 | {"name":"xyz", "depts":["dep03"],"sal":100}
| 3 | {"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}

預計 output

|id | name    | depts              | sal | address_city 
| 1 | "abc"   | ["dep01", "dep02"] | null| null         
| 2 | "xyz"   | ["dep03"]          | 100 | null         
| 3 | "pqr"   | ["dep02"]          | null| "SF"        

輸入 Dataframe -

df = spark.createDataFrame(data = [(1 , """{"name":"abc", "depts":["dep01", "dep02"]}"""), (2 , """{"name":"xyz", "depts":["dep03"],"sal":100}"""), (3 , """{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}""")], schema = ["id", "json_data"])
df.show(truncate=False)

+---+----------------------------------------------------------+
|id |json_data                                                 |
+---+----------------------------------------------------------+
|1  |{"name":"abc", "depts":["dep01", "dep02"]}                |
|2  |{"name":"xyz", "depts":["dep03"],"sal":100}               |
|3  |{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}|
+---+----------------------------------------------------------+

json_data列轉換為MapType ,如下所示 -

from pyspark.sql.functions import *
from pyspark.sql.types import *

df1 = df.withColumn("cols", from_json("json_data", MapType(StringType(), StringType()))).drop("json_data")
df1.show(truncate=False)

+---+-----------------------------------------------------------+
|id |cols                                                       |
+---+-----------------------------------------------------------+
|1  |{name -> abc, depts -> ["dep01","dep02"]}                  |
|2  |{name -> xyz, depts -> ["dep03"], sal -> 100}              |
|3  |{name -> pqr, depts -> ["dep02"], address -> {"city":"SF"}}|
+---+-----------------------------------------------------------+

現在,列cols需要分解如下 -

df2 = df1.select("id",explode("cols").alias("col_columns", "col_rows"))
df2.show(truncate=False)

+---+-----------+-----------------+
|id |col_columns|col_rows         |
+---+-----------+-----------------+
|1  |name       |abc              |
|1  |depts      |["dep01","dep02"]|
|2  |name       |xyz              |
|2  |depts      |["dep03"]        |
|2  |sal        |100              |
|3  |name       |pqr              |
|3  |depts      |["dep02"]        |
|3  |address    |{"city":"SF"}    |
+---+-----------+-----------------+

一旦,您將col_columnscol_rows作為單獨的列,所有需要做的就是pivot col_columns並使用其相應的first col_rows聚合它,如下所示 -

df3 = df2.groupBy("id").pivot("col_columns").agg(first("col_rows"))
df3.show(truncate=False)

+---+-------------+-----------------+----+----+
|id |address      |depts            |name|sal |
+---+-------------+-----------------+----+----+
|1  |null         |["dep01","dep02"]|abc |null|
|2  |null         |["dep03"]        |xyz |100 |
|3  |{"city":"SF"}|["dep02"]        |pqr |null|
+---+-------------+-----------------+----+----+

最后,您需要再次重復上述步驟以將address轉換為結構化格式,如下所示 -

df4 = df3.withColumn("address", from_json("address", MapType(StringType(), StringType())))
df4.select("id", "depts", "name", "sal",explode_outer("address").alias("key", "address_city")).drop("key").show(truncate=False)

+---+-----------------+----+----+------------+
|id |depts            |name|sal |address_city|
+---+-----------------+----+----+------------+
|1  |["dep01","dep02"]|abc |null|null        |
|2  |["dep03"]        |xyz |100 |null        |
|3  |["dep02"]        |pqr |null|SF          |
+---+-----------------+----+----+------------+

為了解決它,您可以使用 split function 作為下面的代碼。

function 有兩個參數,第一個是列本身,第二個是從列數組中拆分元素的模式。

可以在此處找到更多信息和示例:

https://sparkbyexamples.com/pyspark/pyspark-convert-string-to-array-column/#:~:text=PySpark%20SQL%20provides%20split(),and%20converting%20it%20into%20ArrayType

from pyspark.sql import functions as F

df.select(F.split(F.col('depts'), ','))

要在沒有已知架構的情況下動態解析和提升 JSON 字符串列的屬性,恐怕您不能使用 pyspark,可以使用 Scala 來完成。

例如,當你有一些由 Kafka 生成的 avro 文件時,你希望能夠動態解析序列化 JSON 字符串的Value

var df = spark.read.format("avro").load("abfss://abc@def.dfs.core.windows.net/xyz.avro").select("Value")
var df_parsed = spark.read.json(df.as[String])
display(df_parsed)

關鍵是Scala里面的spark.read.json(df.as[String]) ,基本上

  1. 將該 DF(在這種情況下它只有一個我們感興趣的列,您當然可以類似地處理多個感興趣的列並合並任何您想要的列)轉換為String
  2. 使用標准 spark 讀取選項解析 JSON 字符串,這不需要模式。

到目前為止,據我所知,還沒有暴露給 pyspark 的等效方法。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM