Spark：如何通过连接将两个 DataFrame 中的两个相似列合并到一列中？

Question

I have SQL table that I have to update by using data from with table.我有 SQL 表，我必须使用表中的数据进行更新。

For this purpose, I calculate DataFrame.为此，我计算了 DataFrame。

I have two DataFrame: that I calculate and that I get from database.我有两个 DataFrame：我计算并从数据库中获取。

val myDF = spark.read.<todo something>.load()

val dbDF = spark.read.format("jdbc").<...>.load()

Finally, both DataFrame have the same structure.最后，DataFrame 的结构相同。

For example:例如：

myDF我的DF

key钥匙	column柱子
key1键1	1 1
key2键2	2 2
key3关键3	3 3

dbDF数据库

key钥匙	column柱子
key1键1	5 5
key2键2	5 5
key3关键3	5 5

I need to get new DF that will have only one column with name Column.我需要获得只有一列名为 Column 的新 DF。

newDF新东风

key钥匙	column柱子
key1键1	6 6
key2键2	7 7
key3关键3	8 8

For this purpose, I do next actions:为此，我执行以下操作：

myDF
  .as("left")
  .join(dbDF.as("right"), "key")
  .withColumn("column_temp", $"left.column" + $"right.column")
  .drop($"left.column")
  .drop(s"right.column")
  .withColumnRenamed("column_temp", "column")

I have to do these actions for each column that I have to calculate.我必须为我必须计算的每一列执行这些操作。

In other words, my joins don't assume adding new columns.换句话说，我的联接不假定添加新列。 I have to merge similar columns into one column.我必须将相似的列合并为一列。

I can calculate new column by sum two column, or a can just choose not null column from two given columns, like that:我可以通过对两列求和来计算新列，或者可以从两个给定列中选择不是 null 列，如下所示：

myDF
  .as("left")
  .join(dbDF.as("right"), "key")
  .withColumn("column_temp", coalesce($"left.column", $"right.column"))
  .drop($"left.column")
  .drop(s"right.column")
  .withColumnRenamed("column_temp", "column")

And when my DataFrame have many columns and only 1 or 2 key columns, I have to repeat above actions for each column.当我的 DataFrame 有很多列并且只有 1 或 2 个键列时，我必须对每一列重复上述操作。

My question is:我的问题是：

Is there more effective way to do what I do?有没有更有效的方法来做我所做的事情？ Or do I do it right?还是我做对了？

Answer 1

    myDF.join(dbDF,myDF.col("key").equalTo(dbDF.col("key")))
            .select(myDF.col("key"))
            .withColumn("column",myDF.col("key").plus(dbDF.col("key")));

Can you try this?你能试试这个吗？ It is an inner join so only those rows in the left table that have a match in the right are selected.它是一个内连接，因此只有左表中与右表匹配的那些行才会被选中。 Is that your case?那是你的情况吗？

Spark：如何通过连接将两个 DataFrame 中的两个相似列合并到一列中？

问题描述

1 个解决方案

解决方案1
0 2021-11-18 19:20:19

Spark：如何通过连接将两个 DataFrame 中的两个相似列合并到一列中？

问题描述

1 个解决方案

解决方案1 0 2021-11-18 19:20:19

解决方案1
0 2021-11-18 19:20:19