减去两个数据帧中的列以获得 Spark Scala 中的差异

Question

Having two dataframes with same columns, I would like to create a resultant dataframe with the difference between the columns, having into account that the dataframes have a lot of columns (and rows).有两个具有相同列的数据框，我想创建一个结果 dataframe 与列之间的差异，考虑到数据框有很多列（和行）。

I guess the approach is first doing an inner join, and then do a "WithColumn" with a subtract inside, but I don't know how to do this in an automated way for a lot of columns .我猜这个方法是先做一个内部连接，然后做一个“WithColumn” ，里面有一个减法，但我不知道如何以自动化的方式为很多列做到这一点。

Example:例子：

first dataframe:第一个 dataframe：

Id ID	col1 col1	col2 col2	col3 col3	... ...	colXX colXX
1 1	1.1 1.1	1.2 1.2	1.6 1.6	... ...	1.8 1.8

second dataframe:第二个 dataframe：

Id ID	col1 col1	col2 col2	col3 col3	... ...	colXX colXX
1 1	1.2 1.2	1.2 1.2	2.1 2.1	... ...	2.1 2.1

Expected dataframe:预期 dataframe：

Id ID	diff_col1 diff_col1	diff_col2 diff_col2	diff_col3 diff_col3	... ...	diff_colXX diff_colXX
1 1	0.1 0.1	0.0 0.0	0.5 0.5	... ...	0.3 0.3

Thanks beforehand!预先感谢！

Answer 1

First prepare the selection of the differences and then apply it to the resulting dataframe from the join.首先准备差异的选择，然后将其应用于从连接中生成的 dataframe。

val selection = 
    df1.columns.diff(Seq("Id"))
       .map(x => (col(s"df1.$x") - col(s"df2.$x")) as s"diff_$x")

val query = 
    df1.as("df1")
       .join(df2.as("df2"), Seq("Id"), "inner")
       .select((Seq(col("df1.Id")) ++ selection):_*)

Notice the alias on the dataframes in the join matching the names being used in the difference calculations.请注意连接中数据帧上的别名与差异计算中使用的名称匹配。

减去两个数据帧中的列以获得 Spark Scala 中的差异

问题描述

1 个解决方案

解决方案1
0 2022-09-07 21:13:32

减去两个数据帧中的列以获得 Spark Scala 中的差异

问题描述

1 个解决方案

解决方案1 0 2022-09-07 21:13:32

解决方案1
0 2022-09-07 21:13:32