简体   繁体   中英

Exploding JSON array in Pyspark when source changes frequently

Below is my input schema.

root
 |-- Date: string (nullable = true)
 |-- tax: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ID: string (nullable = true)
 |    |    |-- hst: double (nullable = true)
 |    |    |-- gst: double (nullable = true)

My required output is as below.

+-----------+---------+------------------+----------------------+
|  DATE    |  ID      |hst               |   gst                |
+-----------+---------+------------------+----------------------+
|14-Jun-2022|     1   |             100.0|                  12.0|
|14-Jun-2022|     2   |              80.0|                   6.0|
+-----------+---------+------------------+----------------------+

I am using the below code to explode and select the columns from the exploded result.

df = sourceDF.withColumn("exploded_column", explode(col("tax")))
explodeddf = testdf.select(col("DATE"),col("exploded_column.ID"),col("exploded_column.hst"),col("exploded_column.gst"))

I was able to get the output. But I want to know if there is any other way to achieve this if my source dataframe changes frequently.

Thank you.

If the exploded column remains a struct you can use ".*" to get all the fields.

df = sourceDF.withColumn("exploded_column", explode(col("tax")))
explodeddf = testdf.select(col("DATE"),col("exploded_column.*"))

You can dynamically generate varargs to pass to select with the following, it will generate columns. It's similar to '*' but can also be used with a filter to remove columns you wish to not include

df = sourceDF.select( col("DATE"), *[col(column) for column in sourceDF.columns if column != "DATE"] )

I haven't tried combining the two but it might be possible.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM