简体   繁体   中英

Update Schema for DataFrame in Apache Spark

I have a DataFrame with the following schema

root
 |-- col_a: string (nullable = false)
 |-- col_b: string (nullable = false)
 |-- col_c_a: string (nullable = false)
 |-- col_c_b: string (nullable = false)
 |-- col_d: string (nullable = false)
 |-- col_e: string (nullable = false)
 |-- col_f: string (nullable = false)

now I want to convert the Schema for this data frame to something like this.

root
 |-- col_a: string (nullable = false)
 |-- col_b: string (nullable = false)
 |-- col_c: struct (nullable = false)
     |-- col_c_a: string (nullable = false)
     |-- col_c_b: string (nullable = false)
 |-- col_d: string (nullable = false)
 |-- col_e: string (nullable = false)
 |-- col_f: string (nullable = false)

I can able to do this with the help of map transformation by explicitly fetching the value of each column from row type but this is very complex process and does not look good So,

is there any way I can achieve this?

Thanks

There is an in-built struct function with the definition :

def struct(cols: Column*): Column

You can use it like :

df.show
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  2|  3|
+---+---+

df.withColumn("struct_col", struct($"a", $"b")).show
+---+---+----------+
|  a|  b|struct_col|
+---+---+----------+
|  1|  2|     [1,2]|
|  2|  3|     [2,3]|
+---+---+----------+

The schema of the new dataframe being :

 |-- a: integer (nullable = false)
 |-- b: integer (nullable = false)
 |-- struct_col: struct (nullable = false)
 |    |-- a: integer (nullable = false)
 |    |-- b: integer (nullable = false)

In you case, you can do something like :

df.withColumn("col_c" , struct($"col_c_a", $"col_c_b") ).drop($"col_c_a").drop($"col_c_b")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM