简体   繁体   中英

rename spark dataframe structType fields

Given a dynamic structType . here structType name is not known . It is dynamic and hence its name is changing.

The name is variable . So don't pre assume "MAIN_COL" in the schema.

root
 |-- MAIN_COL: struct (nullable = true)
 |    |-- a: string (nullable = true)
 |    |-- b: string (nullable = true)
 |    |-- c: string (nullable = true)
 |    |-- d: string (nullable = true)
 |    |-- f: long (nullable = true)
 |    |-- g: long (nullable = true)
 |    |-- h: long (nullable = true)
 |    |-- j: long (nullable = true)

how can we write a dynamic code to rename the fields of a structType with its name as its prefix.

root
 |-- MAIN_COL: struct (nullable = true)
 |    |-- MAIN_COL_a: string (nullable = true)
 |    |-- MAIN_COL_b: string (nullable = true)
 |    |-- MAIN_COL_c: string (nullable = true)
 |    |-- MAIN_COL_d: string (nullable = true)
 |    |-- MAIN_COL_f: long (nullable = true)
 |    |-- MAIN_COL_g: long (nullable = true)
 |    |-- MAIN_COL_h: long (nullable = true)
 |    |-- MAIN_COL_j: long (nullable = true)

You can use DSL to update the schema of nested columns.

import org.apache.spark.sql.types._

val schema: StructType = df.schema.fields.head.dataType.asInstanceOf[StructType]

val updatedSchema = StructType.apply(
       schema.fields.map(sf => StructField.apply("MAIN_COL_" + sf.name, sf.dataType))
)

val resultDF = df.withColumn("MAIN_COL", $"MAIN_COL".cast(updatedSchema))

Updated Schema:

root
 |-- MAIN_COL: struct (nullable = false)
 |    |-- MAIN_COL_a: string (nullable = true)
 |    |-- MAIN_COL_b: string (nullable = true)
 |    |-- MAIN_COL_c: string (nullable = true)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM