简体   繁体   English

双列数组动态转换为嵌套火花中的多列 dataframe

[英]Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:我当前的 DataFrame 如下所示:

{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}

I want to transform this dataframe into the below dataFrame:我想将这个 dataframe 转换成下面的 dataFrame:

{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}

This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.这意味着,每个“值”数组的项目(0.2、0.4 和 0.6)将乘以 100,以字母“v”作为前缀,并提取到单独的列中。

How does the code would look like in order to achieve this.为了实现这一点,代码看起来如何。 I have tried withColumn but couldn't achieve this.我试过withColumn但无法实现。

Try the below code and please find the inline comments for the code explanation试试下面的代码,请找到代码解释的内联注释

import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType

object DynamicCol {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[*]").getOrCreate()
    val df = spark.read.json("src/main/resources/dyamicCol.json")    /// Load the JSON file
    val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
    val index = dfTemp
      .schema.fieldIndex("values")
    val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
    val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => {     // Join Dataframe with the list of nested columns
      val colNameInt = (field.name.toDouble * 100).toInt
      val colName = s"v$colNameInt"
      df.withColumn(colName,col("inputs.values.`" + field.name + "`"))  // Add the nested column mappings
    }  ).drop("inputs") // Drop the extra column

    dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
  }

}

I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.我会将更改列名拆分器的逻辑分为两部分,一部分是数值,另一部分不变。

def stringDecimalToVNumber(colName:String): String =
  "v" + (colName.toFloat * 100).toInt.toString

and form a single function that transforms according to the case并形成一个根据大小写转换的function

val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
  case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
  case x => x // keep it

now we have the function to transform the end of the columns, let's pick the schema dynamicly.现在我们有 function 来转换列的末尾,让我们动态选择模式。

val flattenDF = df.select("id","inputs.values.*")

val finalDF = flattenDF
  .schema.names
  .foldLeft(flattenDF)((dfacum,x) => {
    val newName = transformColumnName(x)
    if (newName == x)
      dfacum // the name didn't need to be changed
    else 
      dfacum.withColumnRenamed(x, transformColumnName(x))
  })

This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.这会将 inputs.values 中的所有列动态转换为新名称,并将它们放在 id 旁边。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM