简体   繁体   中英

re-assign value to column doesn't work, but create a new column that cannot be selected

I have written a scala function to join 2 dataframes with same schema, says df1 and df2. For every key in df1, if df1's key matches with df2, then we pick up values from df2 for this key, if no then leave df1's value. It supposed to return dataframe with same number of df1 but different value, but the function doesn't work and return same df as df1.

  def joinDFwithConditions(df1: DataFrame, df2: DataFrame, key_seq: Seq[String]) ={

  var final_df = df1.as("a").join(df2.as("b"), key_seq, "left_outer")
  //set of non-key columns
  val col_str = df1.columns.toSet -- key_seq.toSet
  for (c <- col_str){ //for every match-record, check values from both dataframes
  final_df = final_df
        .withColumn(s"$c", 
            when(col(s"b.$c").isNull || col(s"b.$c").isNaN,col(s"a.$c"))
            .otherwise(col(s"b.$c")))
         // I used to re-assign value with reference "t.$c",
         // but return error says no t.col found in schema
}
  final_df.show()

  final_df.select(df1.columns.map(x => df1(x)):_*)

}


  def main(args: Array[String]) {
  val sparkSession = SparkSession.builder().appName(this.getClass.getName)
  .config("spark.hadoop.validateOutputSpecs", "false")
  .enableHiveSupport()
  .getOrCreate()
  import sparkSession.implicits._

  val df1 = List(("key1",1),("key2",2),("key3",3)).toDF("x","y")

  val df2 = List(("key1",9),("key2",8)).toDF("x","y")

  joinDFwithConditions(df1, df2, Seq("x")).show()

  sparkSession.stop()
}

df1 sample

+--------------++--------------------+
|x             ||y                   |     
+--------------++--------------------+
| key1         ||1                   |
| key2         ||2                   |
| key3         ||3                   |
--------------------------------------

df2 sample

+--------------++--------------------+
|x             ||y                   |     
+--------------++--------------------+
| key1         ||9                   |
| key2         ||8                   |
--------------------------------------

expected results:

+--------------++--------------------+
|x             ||y                   |     
+--------------++--------------------+
| key1         ||9                   |
| key2         ||8                   |
| key3         ||3                   |
--------------------------------------

what really shows:

+-------+---+---+
|  x    |  y|  y|
+-------+---+---+
|  key1 |  9|  9|
|  key2 |  8|  8|
|  key3 |  3|  3|
+-------+---+---+

error message


ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Resolved attribute(s) y#6 missing from x#5,y#21,y#22 in operator !Project [x#5, y#6]. Attribute(s) with the same name appear in the operation: y. Please check if the right attribute(s) are used.;;
!Project [x#5, y#6]
+- Project [x#5, CASE WHEN (isnull(y#15) || isnan(cast(y#15 as double))) THEN y#6 ELSE y#15 END AS y#21, CASE WHEN (isnull(y#15) || isnan(cast(y#15 as double))) THEN y#6 ELSE y#15 END AS y#22]
   +- Project [x#5, y#6, y#15]
      +- Join LeftOuter, (x#5 = x#14)
         :- SubqueryAlias `a`
         :  +- Project [_1#2 AS x#5, _2#3 AS y#6]
         :     +- LocalRelation [_1#2, _2#3]
         +- SubqueryAlias `b`
            +- Project [_1#11 AS x#14, _2#12 AS y#15]
               +- LocalRelation [_1#11, _2#12]

When you do df.as("a") , you do not rename the column of the dataframe. You simply allow to access them with a.columnName in order to lift an ambiguity. Therefore, your when goes well because you use aliases but you end up with multiple y columns. I am quite surprised by the way that it manages to replace one of the y columns...

When you try to access it with its name y however (without prefix), spark does know which one you want and throws an error.

To avoid errors, you could simply do everything you need with one select like this:

df1.as("a").join(df2.as("b"), key_cols, "left_outer")
    .select(key_cols.map(col) ++
        df1
            .columns
            .diff(key_cols)
            .map(c => when(col(s"b.$c").isNull || col(s"b.$c").isNaN, col(s"a.$c"))
                   .otherwise(col(s"b.$c"))
                   .alias(c)
            ) : _*)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM