简体   繁体   English

当JOIN列不同时,使用Spark Scala动态联接数据框

[英]Join Dataframes dynamically using Spark Scala when JOIN columns differ

Dynamically select multiple columns while joining different Dataframe in scala spark 在Scala Spark中加入不同的Dataframe时动态选择多个列

From the above link , I was able to have the join expression working , but what if the column names are different, we cannot use Seq(columns) and need to join it dynamically. 通过上面的链接,我可以使join表达式正常工作,但是如果列名不同,该怎么办,我们不能使用Seq(columns)而是需要动态连接它。 Here left_ds and right_ds are the dataframes which I wanted to join. 这里的left_ds和right_ds是我想要加入的数据框。 Below I want to join columns id=acc_id and "acc_no=number" 在下面我想加入id = acc_id和“ acc_no = number”列

left_da => id,acc_no,name,ph left_da => id,acc_no,name,ph

right_ds => acc_id,number,location right_ds => acc_id,编号,位置

val joinKeys="id,acc_id|acc_no,number"
val joinKeyPair: Array[(String, String)] = joinKeys.split("\\|").map(_.split(",")).map(x => x(0).toUpperCase -> x(1).toUpperCase)

val joinExpr: Column = joinKeyPair.map { case (ltable_col, rtable_col) =>left_ds.col(ltable_col) === right_ds.col(rtable_col)}.reduce(_ and _)

left_ds.join(right_ds, joinExpr, "left_outer")

Above is the join expression I was trying but it not working. 上面是我尝试过的join表达式,但是没有用。 Is there a way to achieve this if the join column names are different with out using Seq. 如果连接列名称不同而又不使用Seq,是否有一种方法可以实现此目的。 So if the number of join keys increase ,I should still be able to make the code work dynamically. 因此,如果加入键的数量增加了,我仍然应该能够使代码动态地工作。

With aliases have to work fine: 使用别名必须正常工作:

val conditionArrays = joinKeys.split("\\|").map(c => c.split(","))
val joinExpr = conditionArrays.map { case Array(a, b) => col("a." + a) === col("b." + b) }.reduce(_ and _)
left_ds.alias("a").join(right_ds.alias("b"), joinExpr, "left_outer")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM