[英]How to write dynamic join condition in spark Java API
I want to perform left outer join on Dataset using spark Java API. 我想使用Spark Java API在数据集上执行左外部联接。 How to write dynamic condition to match the multiple columns in join condition.
如何编写动态条件以匹配联接条件中的多个列。
I am having two dataset objects. 我有两个数据集对象。 Both of them having 2 or more columns.
它们都具有2列或更多列。 I am not able to define condition
我无法定义条件
Example which match 1 column with another 将1列与另一列匹配的示例
dataSet = resultData.as("resultData").join(distinctData.as("distinctData"), resultData.col("A").equalTo(distinctData.col("B")), "leftouter").selectExpr(select.toString());
Now Since there are multiple column I am not able to define dynamic expression for matching the multiple columns using Java API. 现在,由于存在多列,因此我无法使用Java API定义用于匹配多列的动态表达式。
Untested code - but this dynamically generates a join condition from a list of column names 未经测试的代码-但这会从列名列表中动态生成连接条件
public Column makeJoinConditional(Dataset<Row> df1, Dataset<Row> df2, List<String> columnNames, Column c) {
if (c==null) {
String top = columnNames.get(0);
columnNames.remove(0);
Column first = df1.col(top).equalTo(df2.col(top));
return makeJoinConditional(df1,df2, columnNames,first);
} else {
if (columnNames.size()==0) {
return c;
} else {
String top = columnNames.get(0);
columnNames.remove(0);
Column next = c.and( df1.col(top).equalTo(df2.col(top)) );
return makeJoinConditional(df1,df2, columnNames,next);
}
}
}
public Dataset<Row> joinDataFrames(Dataset<Row> df1, Dataset<Row> df2, List<String> columns) {
Column joinCols = makeJoinConditional(df1,df2,columns,null);
return df1.join(df2,joinCols);
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.