扁平化 pyspark 嵌套结构 - Pyspark

Question

I want to flatten a nested column into separate ones, containing only a few specific values.我想将嵌套列展平为单独的列，仅包含几个特定值。

First off, I start with an AVRO capture file from Event hub.首先，我从事件中心的 AVRO 捕获文件开始。 Once I convert that to a dataframe in python, I receive the following column after removing the irrelevant ones.一旦我将其转换为 python 中的 dataframe，我会在删除不相关的列后收到以下列。

This column has the following structure.此列具有以下结构。

What I want to do next, is flatten this column remaining specific values.我接下来要做的是展平该列剩余的特定值。

I can get this done for one cell, but because I am dealing with a nested structure the column is not iterable.我可以为一个单元格完成此操作，但因为我正在处理嵌套结构，所以该列不可迭代。

Anyone can help me out?任何人都可以帮助我吗？

Answer 1

You have 2 data fields in your schema so I'd call the first one data_struct and the second one data_array .您的架构中有 2 个data字段，因此我将调用第一个data_struct和第二个data_array 。 If you do avro_df_body.select('data_struct.data_array') , you'd have an ArrayType column, which you can apply explode function to break it down to multiple rows.如果您执行avro_df_body.select('data_struct.data_array') ，您将拥有一个ArrayType列，您可以应用explode function 将其分解为多行。

(avro_df_body
    .withColumn('tmp', F.explode('data_struct.data_array'))
    .withColumn('speed', F.col('tmp.a'))
    .withColumn('timestamp', F.col('tmp.time'))
    .show()
)

扁平化 pyspark 嵌套结构 - Pyspark

问题描述

1 个解决方案

解决方案1
0 2021-05-24 03:32:07

扁平化 pyspark 嵌套结构 - Pyspark

问题描述

1 个解决方案

解决方案1 0 2021-05-24 03:32:07

解决方案1
0 2021-05-24 03:32:07