[英]Spark: How to merge two similar columns from two DataFrames in one column by doing join?
I have SQL table that I have to update by using data from with table.我有 SQL 表,我必须使用表中的数据进行更新。
For this purpose, I calculate DataFrame.为此,我计算了 DataFrame。
I have two DataFrame: that I calculate and that I get from database.我有两个 DataFrame:我计算并从数据库中获取。
val myDF = spark.read.<todo something>.load()
val dbDF = spark.read.format("jdbc").<...>.load()
Finally, both DataFrame have the same structure.最后,DataFrame 的结构相同。
For example:例如:
myDF我的DF
key![]() |
column![]() |
---|---|
key1![]() |
1 ![]() |
key2![]() |
2 ![]() |
key3![]() |
3 ![]() |
dbDF数据库
key![]() |
column![]() |
---|---|
key1![]() |
5 ![]() |
key2![]() |
5 ![]() |
key3![]() |
5 ![]() |
I need to get new DF that will have only one column with name Column.我需要获得只有一列名为 Column 的新 DF。
newDF新东风
key![]() |
column![]() |
---|---|
key1![]() |
6 ![]() |
key2![]() |
7 ![]() |
key3![]() |
8 ![]() |
For this purpose, I do next actions:为此,我执行以下操作:
myDF
.as("left")
.join(dbDF.as("right"), "key")
.withColumn("column_temp", $"left.column" + $"right.column")
.drop($"left.column")
.drop(s"right.column")
.withColumnRenamed("column_temp", "column")
I have to do these actions for each column that I have to calculate.我必须为我必须计算的每一列执行这些操作。
In other words, my joins don't assume adding new columns.换句话说,我的联接不假定添加新列。 I have to merge similar columns into one column.
我必须将相似的列合并为一列。
I can calculate new column by sum two column, or a can just choose not null column from two given columns, like that:我可以通过对两列求和来计算新列,或者可以从两个给定列中选择不是 null 列,如下所示:
myDF
.as("left")
.join(dbDF.as("right"), "key")
.withColumn("column_temp", coalesce($"left.column", $"right.column"))
.drop($"left.column")
.drop(s"right.column")
.withColumnRenamed("column_temp", "column")
And when my DataFrame have many columns and only 1 or 2 key columns, I have to repeat above actions for each column.当我的 DataFrame 有很多列并且只有 1 或 2 个键列时,我必须对每一列重复上述操作。
My question is:我的问题是:
Is there more effective way to do what I do?有没有更有效的方法来做我所做的事情? Or do I do it right?
还是我做对了?
myDF.join(dbDF,myDF.col("key").equalTo(dbDF.col("key")))
.select(myDF.col("key"))
.withColumn("column",myDF.col("key").plus(dbDF.col("key")));
Can you try this?你能试试这个吗? It is an inner join so only those rows in the left table that have a match in the right are selected.
它是一个内连接,因此只有左表中与右表匹配的那些行才会被选中。 Is that your case?
那是你的情况吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.