简体   繁体   English

如何在Spark Java API中编写动态联接条件

[英]How to write dynamic join condition in spark Java API

I want to perform left outer join on Dataset using spark Java API. 我想使用Spark Java API在数据集上执行左外部联接。 How to write dynamic condition to match the multiple columns in join condition. 如何编写动态条件以匹配联接条件中的多个列。

I am having two dataset objects. 我有两个数据集对象。 Both of them having 2 or more columns. 它们都具有2列或更多列。 I am not able to define condition 我无法定义条件

Example which match 1 column with another 将1列与另一列匹配的示例

dataSet = resultData.as("resultData").join(distinctData.as("distinctData"), resultData.col("A").equalTo(distinctData.col("B")), "leftouter").selectExpr(select.toString());

Now Since there are multiple column I am not able to define dynamic expression for matching the multiple columns using Java API. 现在,由于存在多列,因此我无法使用Java API定义用于匹配多列的动态表达式。

Untested code - but this dynamically generates a join condition from a list of column names 未经测试的代码-但这会从列名列表中动态生成连接条件

public Column makeJoinConditional(Dataset<Row> df1, Dataset<Row> df2, List<String> columnNames, Column c)  {

        if (c==null) {
            String  top = columnNames.get(0);
            columnNames.remove(0);
            Column first = df1.col(top).equalTo(df2.col(top));

            return makeJoinConditional(df1,df2, columnNames,first);

        } else {

            if (columnNames.size()==0) {
                return c;
            } else {
                String  top = columnNames.get(0);
                columnNames.remove(0);
                Column next = c.and( df1.col(top).equalTo(df2.col(top)) );
                return makeJoinConditional(df1,df2, columnNames,next);
            }
        }
    }

    public Dataset<Row> joinDataFrames(Dataset<Row> df1, Dataset<Row> df2, List<String> columns) {
        Column joinCols = makeJoinConditional(df1,df2,columns,null);
        return df1.join(df2,joinCols);
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM