嵌套 json 到數據塊中的 tsv pyspark

Question

想要使用 pysoark 在 databricks 筆記本中將嵌套的 json 轉換為 tsv。

下面是可以更改列的 json 結構。

{"tables":[{"name":"Result","columns":[{"name":"JobTime","type":"datetime"},{"name":"Status","type":"string"}]
,"rows":[
["2020-04-19T13:45:12.528Z","Failed"]
,["2020-04-19T14:05:40.098Z","Failed"]
,["2020-04-19T13:46:31.655Z","Failed"]
,["2020-04-19T14:01:16.275Z","Failed"],
["2020-04-19T14:03:16.073Z","Failed"],
["2020-04-19T14:01:16.672Z","Failed"],
["2020-04-19T14:02:13.958Z","Failed"],
["2020-04-19T14:04:41.099Z","Failed"],
["2020-04-19T14:04:41.16Z","Failed"],
["2020-04-19T14:05:14.462Z","Failed"]
]}
]}

我是databricks的新手請幫忙

Answer 1

你有兩種方法來處理這個問題。 您可以使用json庫（或等效庫）在python中進行一些預處理，或者直接加載到pyspark並進行以下操作：

from pyspark.sql import SparkSession
import pyspark.sql.functions as f

spark = SparkSession.builder.getOrCreate()

# your json
so_json = """
{"tables":[{"name":"Result","columns":[{"name":"JobTime","type":"datetime"},{"name":"Status","type":"string"}]
,"rows":[
["2020-04-19T13:45:12.528Z","Failed"]
,["2020-04-19T14:05:40.098Z","Failed"]
,["2020-04-19T13:46:31.655Z","Failed"]
,["2020-04-19T14:01:16.275Z","Failed"],
["2020-04-19T14:03:16.073Z","Failed"],
["2020-04-19T14:01:16.672Z","Failed"],
["2020-04-19T14:02:13.958Z","Failed"],
["2020-04-19T14:04:41.099Z","Failed"],
["2020-04-19T14:04:41.16Z","Failed"],
["2020-04-19T14:05:14.462Z","Failed"]
]}
]}
"""

# load in directly using read.json(), you'll see that this becomes 
# a nested ArrayType/StructType wombo combo
json_df = spark.read.json(spark._sc.parallelize([so_json]))
json_df.printSchema()
root
 |-- tables: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- columns: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- rows: array (nullable = true)
 |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |-- element: string (containsNull = true)


# select nested columns "tables" and "rows" and explode
array_df = json_df.select(f.explode(f.col('tables')['rows'][0]))

Exploding 獲取ArrayType rows並將其拆分為實際行。 然后您可以通過點或切片表示法進行子選擇

array_df.printSchema()
root
 |-- col: array (nullable = true)
 |    |-- element: string (containsNull = true)


tabular_df = array_df.select(
  array_df.col[0].alias("JobTime"), 
  array_df.col[1].alias("Status")
)
tabular_df.show()

+--------------------+------+
|             JobTime|Status|
+--------------------+------+
|2020-04-19T13:45:...|Failed|
|2020-04-19T14:05:...|Failed|
|2020-04-19T13:46:...|Failed|
|2020-04-19T14:01:...|Failed|
|2020-04-19T14:03:...|Failed|
|2020-04-19T14:01:...|Failed|
|2020-04-19T14:02:...|Failed|
|2020-04-19T14:04:...|Failed|
|2020-04-19T14:04:...|Failed|
|2020-04-19T14:05:...|Failed|
+--------------------+------+

最后，您希望使用自定義分隔符 ( \t ) 保存為 CSV。 因此：

tabular_df.write.csv("path/to/file.tsv", sep="\t")

注意：您可能需要手動控制類型，例如將JobTime轉換為TimestampType ，但我將由您決定。 希望這可以幫助。

嵌套 json 到數據塊中的 tsv pyspark

問題描述

1 個解決方案

解決方案1
0 2020-05-14 08:30:47

嵌套 json 到數據塊中的 tsv pyspark

問題描述

1 個解決方案

解決方案1 0 2020-05-14 08:30:47

解決方案1
0 2020-05-14 08:30:47