简体   繁体   English

来自数组的 AWS Glue 动态框架列

[英]AWS Glue Dynamic Frame columns from array

I have a nested json, structured as the following example: {'A':[{'key':'B','value':'C'},{'key':'D','value':'E'}]} Now I want to map this to the following schema:我有一个嵌套的 json,结构如下例: {'A':[{'key':'B','value':'C'},{'key':'D','value':' E'}]} 现在我想将其映射到以下架构:

|--A 
|--|--B
|--|--D

eg A structure recovered from a json like:例如从 json 中恢复的结构,如:

{'A':{'B':'C','D':'E'}}

The array in 'A' has no fixed number of entries, but the contained dicts always have the two keys 'key','value' 'A' 中的数组没有固定数量的条目,但包含的字典总是有两个键 'key','value'

Please find the script below.请在下面找到脚本。

from pyspark.sql.functions import lit, col, explode, create_map, collect_list
from itertools import chain

>>> sample.printSchema()
root
 |-- A: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)


>>> final_df = (sample
...             .select(explode('A').alias("A"))
...             .withColumn("A",create_map("A.key", "A.value"))
...             .groupby().agg(collect_list("A").alias("A"))
... )
>>> final_df.printSchema()
root
 |-- A: array (nullable = true)
 |    |-- element: map (containsNull = false)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)

>>> final_df.show(truncate=False)
+--------------------+
|A                   |
+--------------------+
|[[B -> C], [D -> E]]|
+--------------------+

>>> (final_df
...  .write
...  .format("json")
...  .mode("overwrite")
...  .save("sample_files/2020-09-29/out")
... )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM