[英]Flatten pyspark nested structure - Pyspark
I want to flatten a nested column into separate ones, containing only a few specific values.我想将嵌套列展平为单独的列,仅包含几个特定值。
First off, I start with an AVRO capture file from Event hub.首先,我从事件中心的 AVRO 捕获文件开始。 Once I convert that to a dataframe in python, I receive the following column after removing the irrelevant ones.
一旦我将其转换为 python 中的 dataframe,我会在删除不相关的列后收到以下列。
This column has the following structure.此列具有以下结构。
What I want to do next, is flatten this column remaining specific values.我接下来要做的是展平该列剩余的特定值。
I can get this done for one cell, but because I am dealing with a nested structure the column is not iterable.我可以为一个单元格完成此操作,但因为我正在处理嵌套结构,所以该列不可迭代。
Anyone can help me out?任何人都可以帮助我吗?
You have 2 data
fields in your schema so I'd call the first one data_struct
and the second one data_array
.您的架构中有 2 个
data
字段,因此我将调用第一个data_struct
和第二个data_array
。 If you do avro_df_body.select('data_struct.data_array')
, you'd have an ArrayType
column, which you can apply explode
function to break it down to multiple rows.如果您执行
avro_df_body.select('data_struct.data_array')
,您将拥有一个ArrayType
列,您可以应用explode
function 将其分解为多行。
(avro_df_body
.withColumn('tmp', F.explode('data_struct.data_array'))
.withColumn('speed', F.col('tmp.a'))
.withColumn('timestamp', F.col('tmp.time'))
.show()
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.