[英]Java Spark : Spark Bug Workaround for Datasets Joining with unknow Join Column Names
我正在Java中使用Spark 2.3.1。
我遇到了(我認為)是Spark的這個已知錯誤 。
這是我的代碼:
public Dataset<Row> compute(Dataset<Row> df1, Dataset<Row> df2, List<String> columns){
Seq<String> columns_seq = JavaConverters.asScalaIteratorConverter(columns.iterator()).asScala().toSeq();
final Dataset<Row> join = df1.join(df2, columns_seq);
join.show()
join.withColumn("newColumn", abs(col("value1").minus(col("value2")))).show();
return join;
}
我這樣稱呼我的代碼:
Dataset<Row> myNewDF = compute(MyDataset1, MyDataset2, Arrays.asList("field1","field2","field3","field4"));
注意:MyDataset1和MyDataset2是來自同一數據集MyDataset0的兩個數據集,具有多個不同的轉換。
在join.show()
行中,出現以下錯誤:
2018-08-03 18:48:43 - ERROR main Logging$class - - - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 235, Column 21: Expression "project_isNull_2" is not an rvalue
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 235, Column 21: Expression "project_isNull_2" is not an rvalue
at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
at org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
at org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
...
2018-08-03 18:48:47 - WARN main Logging$class - - - Whole-stage codegen disabled for plan (id=7):
但是它不會停止執行,仍會顯示數據集的內容。
然后,在join.withColumn("newColumn", abs(col("value1").minus(col("value2")))).show();
我得到錯誤:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) 'value2,'value1 missing from field6#16,field7#3,field8#108,field5#0,field9#4,field10#28,field11#323,value1#298,field12#131,day#52,field3#119,value2#22,field2#35,field1#43,field4#144 in operator 'Project [field1#43, field2#35, field3#119, field4#144, field5#0, field6#16, value2#22, field7#3, field9#4, field10#28, day#52, field8#108, field12#131, value1#298, field11#323, abs(('value1 - 'value2)) AS newColumn#2579]. Attribute(s) with the same name appear in the operation: value2,value1. Please check if the right attribute(s) are used.;;
'Project [field1#43, field2#35, field3#119, field4#144, field5#0, field6#16, value2#22, field7#3, field9#4, field10#28, day#52, field8#108, field12#131, value1#298, field11#323, abs(('value1 - 'value2)) AS newColumn#2579]
+- AnalysisBarrier
...
該錯誤結束程序。
在Jira Issue上建議的Mijung Kim解決方法是使用DF(Columns)創建數據集克隆。 但是在我的情況下,用於連接的列名是事先未知的(我只有一個列表),我無法使用此替代方法。
還有另一種方法可以解決這個非常煩人的錯誤嗎?
嘗試調用此方法:
private static Dataset<Row> cloneDataset(Dataset<Row> ds) {
List<Column> filterColumns = new ArrayList<>();
List<String> filterColumnsNames = new ArrayList<>();
scala.collection.Iterator<StructField> it = ds.exprEnc().schema().toIterator();
while (it.hasNext()) {
String columnName = it.next().name();
filterColumns.add(ds.col(columnName));
filterColumnsNames.add(columnName);
}
ds = ds.select(JavaConversions.asScalaBuffer(filterColumns).seq()).toDF(scala.collection.JavaConverters.asScalaIteratorConverter(filterColumnsNames.iterator()).asScala().toSeq());
return ds;
}
在連接之前的兩個數據集上都是這樣的:
df1 = cloneDataset(df1);
df2 = cloneDataset(df2);
final Dataset<Row> join = df1.join(df2, columns_seq);
// or ( based on Nakeuh comment )
final Dataset<Row> join = cloneDataset(df1.join(df2, columns_seq));
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.