简体   繁体   中英

Column comparison in spark scala

I have 2 dataframes like this.

scala> df1.show

+---+---------+
| ID|    Count|
+---+---------+
|  1|20.565656|
|  2|30.676776|
+---+---------+

scala> df2.show

+---+-----------+
| ID|      Count|
+---+-----------+
|  1|10.00998787|
|  2|    40.7767|
+---+-----------+

How can i take take the max of the column-count after join?

Expected output.

+---+---------+
| id|    Count|      
+---+---------+
|  1|20.565656|
|  2|40.7767  |    
+---+---------+

After joining both dataframes, create an UDF with 2 count columns as input and in the UDF return the greatest value between those columns.

  • Always its a good practice to use UDF when we need to derive a single column based on multiple columns.

You can do this:

df1.union(df2).groupBy("ID").max("Count").show()

+---+----------+
| ID|max(Count)|
+---+----------+
|  1| 20.565656|
|  2|   40.7767|
+---+----------+
scala> df.show()
+---+---------+
| ID|    Count|
+---+---------+
|  1|20.565656|
|  2|30.676776|
+---+---------+


scala> df1.show()
+---+-----------+
| ID|      Count|
+---+-----------+
|  1|10.00998787|
|  2|    40.7767|
+---+-----------+


scala> df.alias("x").join(df1.alias("y"), List("ID"))
                    .select(col("ID"), col("x.count").alias("Xcount"),col("y.count").alias("Ycount"))
                    .withColumn("Count", when(col("Xcount") >= col("Ycount"), col("Xcount")).otherwise(col("Ycount")))
                    .drop("Xcount", "YCount")
                    .show()
+---+---------+
| ID|    Count|
+---+---------+
|  1|20.565656|
|  2|  40.7767|
+---+---------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM