Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

Question

My current DataFrame looks like as below:

{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}

I want to transform this dataframe into the below dataFrame:

{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}

This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.

How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.

Answer 1

Try the below code and please find the inline comments for the code explanation

import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType

object DynamicCol {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[*]").getOrCreate()
    val df = spark.read.json("src/main/resources/dyamicCol.json")    /// Load the JSON file
    val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
    val index = dfTemp
      .schema.fieldIndex("values")
    val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
    val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => {     // Join Dataframe with the list of nested columns
      val colNameInt = (field.name.toDouble * 100).toInt
      val colName = s"v$colNameInt"
      df.withColumn(colName,col("inputs.values.`" + field.name + "`"))  // Add the nested column mappings
    }  ).drop("inputs") // Drop the extra column

    dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
  }

}

Answer 2

I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.

def stringDecimalToVNumber(colName:String): String =
  "v" + (colName.toFloat * 100).toInt.toString

and form a single function that transforms according to the case

val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
  case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
  case x => x // keep it

now we have the function to transform the end of the columns, let's pick the schema dynamicly.

val flattenDF = df.select("id","inputs.values.*")

val finalDF = flattenDF
  .schema.names
  .foldLeft(flattenDF)((dfacum,x) => {
    val newName = transformColumnName(x)
    if (newName == x)
      dfacum // the name didn't need to be changed
    else 
      dfacum.withColumnRenamed(x, transformColumnName(x))
  })

This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

Question

2 answers

solution1
0 2020-05-15 18:14:45

solution2
0 2020-05-15 18:44:37

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

Question

2 answers

solution1 0 2020-05-15 18:14:45

solution2 0 2020-05-15 18:44:37

solution1
0 2020-05-15 18:14:45

solution2
0 2020-05-15 18:44:37