简体   繁体   中英

How to change the datatype of a column in StructField of a StructType?

I am trying to change the datatype of a column present in a dataframe I that I am reading from an RDBMS database. To do that, I got the schema of the dataframe in the below way:

val dataSchema = dataDF.schema

To see the schema of the dataframe, I used the below statement:

println(dataSchema.schema)

Output: StructType(StructField(je_header_id,LongType,true), StructField(je_line_num,LongType,true), StructField(last_update_date,TimestampType,true), StructField(last_updated_by,DecimalType(15,0),true), StructField(creation_date,TimestampType,true), StructField(created_by,DecimalType(15,0),true), StructField(created_by_name,StringType,true), StructField(entered_dr,DecimalType(38,30),true), StructField(entered_cr,DecimalType(38,30),true))

My requirement is find the DecimalType and change it to DoubleType from the above schema. I can get the column name and the datatypes using: dataSchema.dtype but it gives me the datatypes in the format of ((columnName1, column datatype),(columnName2, column datatype)....(columnNameN, column datatype))

I am trying to find a way to parse the StructType and change the schema in dataSchema in vain.

Could anyone let me know if there is a way to parse the StructType so that I can change the datatype to my requirement and get in the below format

StructType(StructField(je_header_id,LongType,true), StructField(je_line_num,LongType,true), StructField(last_update_date,TimestampType,true), StructField(last_updated_by,DoubleType,true), StructField(creation_date,TimestampType,true), StructField(created_by,DoubleType,true), StructField(created_by_name,StringType,true), StructField(entered_dr,DoubleType,true), StructField(entered_cr,DoubleType,true))

To modify a DataFrame Schema specific to a given data type, you can pattern-match against StructField 's dataType , as shown below:

import org.apache.spark.sql.types._

val df = Seq(
  (1L, BigDecimal(12.34), "a", BigDecimal(10.001)),
  (2L, BigDecimal(56.78), "b", BigDecimal(20.002))
).toDF("c1", "c2", "c3", "c4")

val newSchema = df.schema.fields.map{
  case StructField(name, _: DecimalType, nullable, _)
    => StructField(name, DoubleType, nullable)
  case field => field
}
// newSchema: Array[org.apache.spark.sql.types.StructField] = Array(
//   StructField(c1,LongType,false), StructField(c2,DoubleType,true),
//   StructField(c3,StringType,true), StructField(c4,DoubleType,true)
// )

However, assuming your end-goal is to transform the dataset with the column type change, it would be easier to just traverse the columns for the targeted data type to iteratively cast them, like below:

import org.apache.spark.sql.functions._

val df2 = df.dtypes.
  collect{ case (dn, dt) if dt.startsWith("DecimalType") => dn }.
  foldLeft(df)((accDF, c) => accDF.withColumn(c, col(c).cast("Double")))

df2.printSchema
// root
//  |-- c1: long (nullable = false)
//  |-- c2: double (nullable = true)
//  |-- c3: string (nullable = true)
//  |-- c4: double (nullable = true)

[UPDATE]

Per additional requirement from comments, if you want to change schema only for DecimalType with positive scale, just apply a Regex pattern-match as the guard condition in method collect :

val pattern = """DecimalType\(\d+,(\d+)\)""".r

val df2 = df.dtypes.
  collect{ case (dn, dt) if pattern.findFirstMatchIn(dt).map(_.group(1)).getOrElse("0") != "0" => dn }.
  foldLeft(df)((accDF, c) => accDF.withColumn(c, col(c).cast("Double")))

Here is another way:

data.show(false)
data.printSchema

+----+------------------------+----+----------------------+
|col1|col2                    |col3|col4                  |
+----+------------------------+----+----------------------+
|1   |0.003200000000000000    |a   |23.320000000000000000 |
|2   |78787.990030000000000000|c   |343.320000000000000000|
+----+------------------------+----+----------------------+

root
 |-- col1: integer (nullable = false)
 |-- col2: decimal(38,18) (nullable = true)
 |-- col3: string (nullable = true)
 |-- col4: decimal(38,18) (nullable = true) 

Create a schema that you want:
Exampe:

val newSchema = StructType(
  Seq(
    StructField("col1", StringType, true),
    StructField("col2", DoubleType, true),
    StructField("col3", StringType, true),
    StructField("col4", DoubleType, true)
  )
)

Cast the columns to the required datatype.

val newDF = data.selectExpr(newSchema.map(
   col => s"CAST ( ${col.name} As ${col.dataType.sql}) ${col.name}"
  ): _*)

newDF.printSchema

root
 |-- col1: string (nullable = false)
 |-- col2: double (nullable = true)
 |-- col3: string (nullable = true)
 |-- col4: double (nullable = true) 

newDF.show(false)
+----+-----------+----+------+
|col1|col2       |col3|col4  |
+----+-----------+----+------+
|1   |0.0032     |a   |23.32 |
|2   |78787.99003|c   |343.32|
+----+-----------+----+------+

The accepted solution works great, but its very costly because of huge cost of withColumn, and analyzer has to analyze whole DF for each withColumn, and with a large number of columns it is very costly. I would rather suggest doing this -

val transformedColumns = inputDataDF.dtypes
      .collect {
        case (dn, dt)
            if (dt.startsWith("DecimalType")) =>
          (dn, DoubleType)
      }

    val transformedDF = inputDataDF.select(transformedColumns.map { fieldType =>
      inputDataDF(fieldType._1).cast(fieldType._2)
    }: _*)

For a very small dataset it took 1+ minute with withColumn approach for me in my machine and 100 ms with approach with select.

you can read more about cost of withColumn here - https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM