I have SQL table that I have to update by using data from with table.
For this purpose, I calculate DataFrame.
I have two DataFrame: that I calculate and that I get from database.
val myDF = spark.read.<todo something>.load()
val dbDF = spark.read.format("jdbc").<...>.load()
Finally, both DataFrame have the same structure.
For example:
myDF
key | column |
---|---|
key1 | 1 |
key2 | 2 |
key3 | 3 |
dbDF
key | column |
---|---|
key1 | 5 |
key2 | 5 |
key3 | 5 |
I need to get new DF that will have only one column with name Column.
newDF
key | column |
---|---|
key1 | 6 |
key2 | 7 |
key3 | 8 |
For this purpose, I do next actions:
myDF
.as("left")
.join(dbDF.as("right"), "key")
.withColumn("column_temp", $"left.column" + $"right.column")
.drop($"left.column")
.drop(s"right.column")
.withColumnRenamed("column_temp", "column")
I have to do these actions for each column that I have to calculate.
In other words, my joins don't assume adding new columns. I have to merge similar columns into one column.
I can calculate new column by sum two column, or a can just choose not null column from two given columns, like that:
myDF
.as("left")
.join(dbDF.as("right"), "key")
.withColumn("column_temp", coalesce($"left.column", $"right.column"))
.drop($"left.column")
.drop(s"right.column")
.withColumnRenamed("column_temp", "column")
And when my DataFrame have many columns and only 1 or 2 key columns, I have to repeat above actions for each column.
My question is:
Is there more effective way to do what I do? Or do I do it right?
myDF.join(dbDF,myDF.col("key").equalTo(dbDF.col("key")))
.select(myDF.col("key"))
.withColumn("column",myDF.col("key").plus(dbDF.col("key")));
Can you try this? It is an inner join so only those rows in the left table that have a match in the right are selected. Is that your case?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.