简体   繁体   中英

In spark, how can i rename column names of dataframe without reassignment?

I have a dataframe named dataDF which columns I want to rename. Other dataframe mapDF has "original_name" -> "code_name" mapping. I want to change dataDF's columns names from its "original_name" to "code_name" as per mapDF having those values. I am trying to re-assign dataDF in a loop, but yields low performance when the data size is huge and also losing parallelism. Can this be done in a better way to achieve parallelism and good performance with a huge dataDF dataset?

import sparkSession.sqlContext.implicits._
    var dataDF = Seq((10, 20, 30, 40, 50),(100, 200, 300, 400, 500),(10, 222, 333, 444, 555),(1123, 2123, 3123, 4123, 5123),(1321, 2321, 3321, 4321, 5321))
      .toDF("col_1", "col_2", "col_3", "col_4", "col_5")
    dataDF.show(false)

    val mapDF = Seq(("col_1", "code_1", "true"),("col_3", "code_3", "true"),("col_4", "code_4", "true"),("col_5", "code_5", "true"))
      .toDF("original_name", "code_name", "important")
    mapDF.show(false)

    val map_of_codename = mapDF.rdd.map(x => (x.getString(0), x.getString(1))).collectAsMap()

    dataDF.columns.foreach(x => {
      if (map_of_codename.contains(x))
        dataDF = dataDF.withColumnRenamed(x, map_of_codename.get(x).get)
      else
        dataDF = dataDF.withColumnRenamed(x, "None")
    }
    )
    dataDF.show(false)

========================
dataDF
+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|
+-----+-----+-----+-----+-----+
|10   |20   |30   |40   |50   |
|100  |200  |300  |400  |500  |
|10   |222  |333  |444  |555  |
|1123 |2123 |3123 |4123 |5123 |
|1321 |2321 |3321 |4321 |5321 |
+-----+-----+-----+-----+-----+

mapDF
+-------------+---------+---------+
|original_name|code_name|important|
+-------------+---------+---------+
|col_1        |code_1   |true     |
|col_3        |code_3   |true     |
|col_4        |code_4   |true     |
|col_5        |code_5   |true     |
+-------------+---------+---------+

expected DF:
+------+----+------+------+------+
|code_1|None|code_3|code_4|code_5|
+------+----+------+------+------+
|10    |20  |30    |40    |50    |
|100   |200 |300   |400   |500   |
|10    |222 |333   |444   |555   |
|1123  |2123|3123  |4123  |5123  |
|1321  |2321|3321  |4321  |5321  |
+------+----+------+------+------+

As an alternative you can try to use the aliases, like that:

val aliases = dataDF.columns.map(columnName => $"${columnName}".as(map_of_codename.getOrElse(columnName, "None")))
dataDF.select(aliases: _*).show()

dataDF.select(aliases: _*).explain(true)

The execution plan will be then composed of a single projection node like and it may help to reduce the optimization phase:

== Analyzed Logical Plan ==
code_1: int, None: int, code_3: int, code_4: int, code_5: int
Project [col_1#16 AS code_1#77, col_2#17 AS None#78, col_3#18 AS code_3#79, col_4#19 AS code_4#80, col_5#20 AS code_5#81]
+- Project [_1#5 AS col_1#16, _2#6 AS col_2#17, _3#7 AS col_3#18, _4#8 AS col_4#19, _5#9 AS col_5#20]
   +- LocalRelation [_1#5, _2#6, _3#7, _4#8, _5#9]

That being said, I'm not sure if it will solve the performance issue because in both cases, yours foreach and the proposal above, the physical plan can be optimized to a single node thanks to the CollapseProject rule.

FYI, withColumnRenamed uses similar approach under-the-hood, except that it does it for every column separately:

  def withColumnRenamed(existingName: String, newName: String): DataFrame = {
    val resolver = sparkSession.sessionState.analyzer.resolver
    val output = queryExecution.analyzed.output
    val shouldRename = output.exists(f => resolver(f.name, existingName))
    if (shouldRename) {
      val columns = output.map { col =>
        if (resolver(col.name, existingName)) {
          Column(col).as(newName)
        } else {
          Column(col)
        }
      }
      select(columns : _*)
    } else {
      toDF()
    }
  }

Do you have any input related to the observed performance issues? Some measures that could help to identify the operation taking time? Maybe it's not necessarily related to the column renaming? What are you doing later with these renamed columns?

One approach is to obtain the full mapping column list without spark first, then to a for loop to rename all column instead of call columns.foreach

Here is an example of my solution (sorry I wasn't an expert in Scala, some data parsing might be ugly)

var dataDF = Seq((10, 20, 30, 40, 50),(100, 200, 300, 400, 500),(10, 222, 333, 444, 555),(1123, 2123, 3123, 4123, 5123),(1321, 2321, 3321, 4321, 5321))
  .toDF("col_1", "col_2", "col_3", "col_4", "col_5")
dataDF.show(false)

val mapDF = Seq(("col_1", "code_1", "true"),("col_3", "code_3", "true"),("col_4", "code_4", "true"),("col_5", "code_5", "true"))
  .toDF("original_name", "code_name", "important")

val schema_mapping = mapDF.select("original_name", "code_name").collect()
//For mapping of None column (col 2)
val none_mapping = old_schema.map(x => if (!schema_mapping.map(x => x(0)).contains(x)) Array[String](x, "None")).filter(_ != ())


for(i <- 0 until schema_mapping.length){
    try {
        dataDF = dataDF.withColumnRenamed(schema_mapping(i)(0).toString, schema_mapping(i)(1).toString)
    }
    catch{
        case e : Throwable => println("cannot rename" +  schema_mapping(i)(0).toString + " to " + schema_mapping(i)(1).toString)
    }
}

for(i <- 0 until none_mapping.length){
    try {
        dataDF = dataDF.withColumnRenamed(none_mapping(i).asInstanceOf[Array[String]](0), none_mapping(i).asInstanceOf[Array[String]](1))
    }
    catch{
        case e : Throwable => println("cannot rename")
    }
}

dataDF.show(false)

In spark UI for each column rename it become a stage, but those stage should be executed in parallel when we see with DAG visualization.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM