Java Spark：数据集的Spark Bug变通办法，使用unknow连接列名称

Question

I am using Spark 2.3.1 with Java. 我正在Java中使用Spark 2.3.1。

I have encountered what (I think), is this known bug of Spark . 我遇到了（我认为）是Spark的这个已知错误。

Here is my code : 这是我的代码：

public Dataset<Row> compute(Dataset<Row> df1, Dataset<Row> df2, List<String> columns){
    Seq<String> columns_seq = JavaConverters.asScalaIteratorConverter(columns.iterator()).asScala().toSeq();

    final Dataset<Row> join = df1.join(df2, columns_seq);

    join.show()

    join.withColumn("newColumn", abs(col("value1").minus(col("value2")))).show();

    return join;
}

I call my code like this : 我这样称呼我的代码：

Dataset<Row> myNewDF = compute(MyDataset1, MyDataset2, Arrays.asList("field1","field2","field3","field4"));

Note : MyDataset1 and MyDataset2 are two datasets that come from the same Dataset MyDataset0 with multiple different transformations. 注意：MyDataset1和MyDataset2是来自同一数据集MyDataset0的两个数据集，具有多个不同的转换。

On the join.show() line, I get the following error : 在join.show()行中，出现以下错误：

2018-08-03 18:48:43 - ERROR main Logging$class -  -  - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 235, Column 21: Expression "project_isNull_2" is not an rvalue
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 235, Column 21: Expression "project_isNull_2" is not an rvalue
    at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
    at org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
    at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
    at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
    at org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
    at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
    ...

2018-08-03 18:48:47 - WARN main Logging$class -  -  - Whole-stage codegen disabled for plan (id=7):

But it does not stop the execution and still displays the content of the dataset. 但是它不会停止执行，仍会显示数据集的内容。

Then, on the line join.withColumn("newColumn", abs(col("value1").minus(col("value2")))).show(); 然后，在join.withColumn("newColumn", abs(col("value1").minus(col("value2")))).show();

I get the error : 我得到错误：

Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) 'value2,'value1 missing from field6#16,field7#3,field8#108,field5#0,field9#4,field10#28,field11#323,value1#298,field12#131,day#52,field3#119,value2#22,field2#35,field1#43,field4#144 in operator 'Project [field1#43, field2#35, field3#119, field4#144, field5#0, field6#16, value2#22, field7#3, field9#4, field10#28, day#52, field8#108, field12#131, value1#298, field11#323, abs(('value1 - 'value2)) AS newColumn#2579]. Attribute(s) with the same name appear in the operation: value2,value1. Please check if the right attribute(s) are used.;;
'Project [field1#43, field2#35, field3#119, field4#144, field5#0, field6#16, value2#22, field7#3, field9#4, field10#28, day#52, field8#108, field12#131, value1#298, field11#323, abs(('value1 - 'value2)) AS newColumn#2579]
+- AnalysisBarrier
...

This error end the program. 该错误结束程序。

The workaround proposed Mijung Kim on the Jira Issue is to create a Dataset clone thanks to toDF(Columns). 在Jira Issue上建议的Mijung Kim解决方法是使用DF（Columns）创建数据集克隆。 But in my case, where the column names used for the join are not known in advance (I only have a List), I can't use this workaround. 但是在我的情况下，用于连接的列名是事先未知的（我只有一个列表），我无法使用此替代方法。

Is there another way to get around this very annoying bug ? 还有另一种方法可以解决这个非常烦人的错误吗？

Answer 1

Try to call this method: 尝试调用此方法：

private static Dataset<Row> cloneDataset(Dataset<Row> ds) {
    List<Column> filterColumns = new ArrayList<>();
    List<String> filterColumnsNames = new ArrayList<>();
    scala.collection.Iterator<StructField> it = ds.exprEnc().schema().toIterator();
    while (it.hasNext()) {
        String columnName = it.next().name();
        filterColumns.add(ds.col(columnName));
        filterColumnsNames.add(columnName);
    }
    ds = ds.select(JavaConversions.asScalaBuffer(filterColumns).seq()).toDF(scala.collection.JavaConverters.asScalaIteratorConverter(filterColumnsNames.iterator()).asScala().toSeq());
    return ds;
}

on both datasets just before the join like this : 在连接之前的两个数据集上都是这样的：

df1 = cloneDataset(df1);
df2 = cloneDataset(df2);
final Dataset<Row> join = df1.join(df2, columns_seq);
// or ( based on Nakeuh comment )
final Dataset<Row> join = cloneDataset(df1.join(df2, columns_seq));

Java Spark：数据集的Spark Bug变通办法，使用unknow连接列名称

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-08-07 21:43:58

Java Spark：数据集的Spark Bug变通办法，使用unknow连接列名称

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-08-07 21:43:58

解决方案1
3 已采纳 2018-08-07 21:43:58