[英]Subtract columns in two dataframes to get differences in Spark Scala
Having two dataframes with same columns, I would like to create a resultant dataframe with the difference between the columns, having into account that the dataframes have a lot of columns (and rows).有两个具有相同列的数据框,我想创建一个结果 dataframe 与列之间的差异,考虑到数据框有很多列(和行)。
I guess the approach is first doing an inner join, and then do a "WithColumn" with a subtract inside, but I don't know how to do this in an automated way for a lot of columns .我猜这个方法是先做一个内部连接,然后做一个“WithColumn” ,里面有一个减法,但我不知道如何以自动化的方式为很多列做到这一点。
Example:例子:
first dataframe:第一个 dataframe:
Id ![]() |
col1 ![]() |
col2 ![]() |
col3 ![]() |
... ![]() |
colXX ![]() |
---|---|---|---|---|---|
1 ![]() |
1.1 ![]() |
1.2 ![]() |
1.6 ![]() |
... ![]() |
1.8 ![]() |
second dataframe:第二个 dataframe:
Id ![]() |
col1 ![]() |
col2 ![]() |
col3 ![]() |
... ![]() |
colXX ![]() |
---|---|---|---|---|---|
1 ![]() |
1.2 ![]() |
1.2 ![]() |
2.1 ![]() |
... ![]() |
2.1 ![]() |
Expected dataframe:预期 dataframe:
Id ![]() |
diff_col1 ![]() |
diff_col2 ![]() |
diff_col3 ![]() |
... ![]() |
diff_colXX ![]() |
---|---|---|---|---|---|
1 ![]() |
0.1 ![]() |
0.0 ![]() |
0.5 ![]() |
... ![]() |
0.3 ![]() |
Thanks beforehand!预先感谢!
First prepare the selection of the differences and then apply it to the resulting dataframe from the join.首先准备差异的选择,然后将其应用于从连接中生成的 dataframe。
val selection =
df1.columns.diff(Seq("Id"))
.map(x => (col(s"df1.$x") - col(s"df2.$x")) as s"diff_$x")
val query =
df1.as("df1")
.join(df2.as("df2"), Seq("Id"), "inner")
.select((Seq(col("df1.Id")) ++ selection):_*)
Notice the alias on the dataframes in the join matching the names being used in the difference calculations.请注意连接中数据帧上的别名与差异计算中使用的名称匹配。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.