[英]How to join two spark dataset to one with java objects?
I have a little problem joining two datasets in spark, I have this: 我在spark中加入两个数据集有点问题,我有这个:
SparkConf conf = new SparkConf()
.setAppName("MyFunnyApp")
.setMaster("local[*]");
SparkSession spark = SparkSession
.builder()
.config(conf)
.config("spark.debug.maxToStringFields", 150)
.getOrCreate();
//...
//Do stuff
//...
Encoder<MyOwnObject1> encoderObject1 = Encoders.bean(MyOwnObject1.class);
Encoder<MyOwnObject2> encoderObject2 = Encoders.bean(MyOwnObject2.class);
Dataset<MyOwnObject1> object1DS = spark.read()
.option("header","true")
.option("delimiter",";")
.option("inferSchema","true")
.csv(pathToFile1)
.as(encoderObject1);
Dataset<MyOwnObject2> object2DS = spark.read()
.option("header","true")
.option("delimiter",";")
.option("inferSchema","true")
.csv(pathToFile2)
.as(encoderObject2);
I can print the schema and show it correctly. 我可以打印架构并正确显示它。
//Here start the problem
Dataset<Tuple2<MyOwnObject1, MyOwnObject2>> joinObjectDS =
object1DS.join(object2DS, object1DS.col("column01")
.equalTo(object2DS.col("column01")))
.as(Encoders.tuple(MyOwnObject1,MyOwnObject2));
Last line can't make join and get me this error: 最后一行无法连接并得到我这个错误:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Try to map struct<"LIST WITH ALL VARS FROM TWO OBJECT"> to Tuple2, but failed as the number of fields does not line up.;
That's true, because Tuple2 (object2) doesn't have all vars... 这是真的,因为Tuple2(object2)没有所有变量......
Then I had tried this: 然后我试过这个:
Dataset<Tuple2<MyOwnObject1, MyOwnObject2>> joinObjectDS = object1DS
.joinWith(object2DS, object1DS
.col("column01")
.equalTo(object2DS.col("column01")));
And works fine! 并且工作正常! But, I need a new Dataset without tuple, I have an object3, that have some vars from object1 and object2, then I have this problem: 但是,我需要一个没有元组的新数据集,我有一个object3,它有一些来自object1和object2的变量,然后我有这个问题:
Encoder<MyOwnObject3> encoderObject3 = Encoders.bean(MyOwnObject3.class);
Dataset<MyOwnObject3> object3DS = joinObjectDS.map(tupleObject1Object2 -> {
MyOwnObject1 myOwnObject1 = tupleObject1Object2._1();
MyOwnObject2 myOwnObject2 = tupleObject1Object2._2();
MyOwnObject3 myOwnObject3 = new MyOwnObject3(); //Sets all vars with start values
//...
//Sets data from object 1 and 2 to 3.
//...
return myOwnObject3;
}, encoderObject3);
Fails!... here is the error: 失败!...这是错误:
17/05/10 12:17:43 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 593, Column 72: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import
and over thousands error lines... 超过数千条错误线......
What can I do? 我能做什么? I had tried: 我试过了:
I want to use Datasets, to use the speed from Dataframes and object sintax from JavaRDD... 我想使用数据集,使用来自Dataframes的速度和来自JavaRDD的对象sintax ......
Help? 救命?
Thanks 谢谢
Finally I found a solution, 最后我找到了解决方案,
I had a problem with the option inferSchema when my code was creating a Dataset. 当我的代码创建数据集时,我遇到了使用inferSchema选项的问题。 I have a String column that the option inferSchema return me an Integer column because all values are "numeric", but i need use them as String (like "0001", "0002"...) I need to do a schema, but I have many vars, then I write this with all my classes: 我有一个String列,选项inferSchema返回一个Integer列,因为所有值都是“numeric”,但我需要将它们用作String(如“0001”,“0002”......)我需要做一个架构,但是我有很多变量,然后我用我的所有类写这个:
List<StructField> fieldsObject1 = new ArrayList<>();
for (Field field : MyOwnObject1.class.getDeclaredFields()) {
fieldsObject1.add(DataTypes.createStructField(
field.getName(),
CatalystSqlParser.parseDataType(field.getType().getSimpleName()),
true)
);
}
StructType schemaObject1 = DataTypes.createStructType(fieldsObject1);
Dataset<MyOwnObject1> object1DS = spark.read()
.option("header","true")
.option("delimiter",";")
.schema(schemaObject1)
.csv(pathToFile1)
.as(encoderObject1);
Works fine. 工作正常。
The "best" solution would be this: “最佳”解决方案是这样的:
Dataset<MyOwnObject1> object1DS = spark.read()
.option("header","true")
.option("delimiter",";")
.schema(encoderObject1.schema())
.csv(pathToFile1)
.as(encoderObject1);
but encoderObject1.schema() returns me a Schema with vars in alphabetical order, not in original order, then this option fails when I read a csv. 但是encoderObject1.schema()按字母顺序返回一个包含vars的Schema,而不是按原始顺序返回,然后当我读取csv时,此选项失败。 Maybe Encoders should return a schema with vars in original order and not in alphabetical order 也许编码器应该按原始顺序返回带有变量的模式,而不是按字母顺序返回
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.