双列数组动态转换为嵌套火花中的多列 dataframe

[英]Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:我当前的 DataFrame 如下所示:


I want to transform this dataframe into the below dataFrame:我想将这个 dataframe 转换成下面的 dataFrame:

{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}

This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.这意味着,每个“值”数组的项目(0.2、0.4 和 0.6)将乘以 100,以字母“v”作为前缀,并提取到单独的列中。

How does the code would look like in order to achieve this.为了实现这一点,代码看起来如何。 I have tried withColumn but couldn't achieve this.我试过withColumn但无法实现。

Try the below code and please find the inline comments for the code explanation试试下面的代码,请找到代码解释的内联注释

import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType

object DynamicCol {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[*]").getOrCreate()
    val df = spark.read.json("src/main/resources/dyamicCol.json")    /// Load the JSON file
    val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
    val index = dfTemp
    val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
    val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => {     // Join Dataframe with the list of nested columns
      val colNameInt = (field.name.toDouble * 100).toInt
      val colName = s"v$colNameInt"
      df.withColumn(colName,col("inputs.values.`" + field.name + "`"))  // Add the nested column mappings
    }  ).drop("inputs") // Drop the extra column

    dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file


I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.我会将更改列名拆分器的逻辑分为两部分,一部分是数值,另一部分不变。

def stringDecimalToVNumber(colName:String): String =
  "v" + (colName.toFloat * 100).toInt.toString

and form a single function that transforms according to the case并形成一个根据大小写转换的function

val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
  case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
  case x => x // keep it

now we have the function to transform the end of the columns, let's pick the schema dynamicly.现在我们有 function 来转换列的末尾,让我们动态选择模式。

val flattenDF = df.select("id","inputs.values.*")

val finalDF = flattenDF
  .foldLeft(flattenDF)((dfacum,x) => {
    val newName = transformColumnName(x)
    if (newName == x)
      dfacum // the name didn't need to be changed
      dfacum.withColumnRenamed(x, transformColumnName(x))

This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.这会将 inputs.values 中的所有列动态转换为新名称,并将它们放在 id 旁边。

