如何使用java对象将两个spark数据集连接到一个？

Question

I have a little problem joining two datasets in spark, I have this: 我在spark中加入两个数据集有点问题，我有这个：

SparkConf conf = new SparkConf()
    .setAppName("MyFunnyApp")
    .setMaster("local[*]");

SparkSession spark = SparkSession
    .builder()
    .config(conf)
    .config("spark.debug.maxToStringFields", 150)
    .getOrCreate();
//...
//Do stuff
//...
Encoder<MyOwnObject1> encoderObject1 = Encoders.bean(MyOwnObject1.class);
Encoder<MyOwnObject2> encoderObject2 = Encoders.bean(MyOwnObject2.class);

Dataset<MyOwnObject1> object1DS = spark.read()
    .option("header","true")
    .option("delimiter",";")
    .option("inferSchema","true")
    .csv(pathToFile1)
    .as(encoderObject1);

Dataset<MyOwnObject2> object2DS = spark.read()
    .option("header","true")
    .option("delimiter",";")
    .option("inferSchema","true")
    .csv(pathToFile2)
    .as(encoderObject2);

I can print the schema and show it correctly. 我可以打印架构并正确显示它。

//Here start the problem
Dataset<Tuple2<MyOwnObject1, MyOwnObject2>> joinObjectDS = 
    object1DS.join(object2DS, object1DS.col("column01")
    .equalTo(object2DS.col("column01")))
    .as(Encoders.tuple(MyOwnObject1,MyOwnObject2));

Last line can't make join and get me this error: 最后一行无法连接并得到我这个错误：

Exception in thread "main" org.apache.spark.sql.AnalysisException: Try to map struct<"LIST WITH ALL VARS FROM TWO OBJECT"> to Tuple2, but failed as the number of fields does not line up.;

That's true, because Tuple2 (object2) doesn't have all vars... 这是真的，因为Tuple2（object2）没有所有变量......

Then I had tried this: 然后我试过这个：

 Dataset<Tuple2<MyOwnObject1, MyOwnObject2>> joinObjectDS = object1DS
    .joinWith(object2DS, object1DS
        .col("column01")
        .equalTo(object2DS.col("column01")));

And works fine! 并且工作正常！ But, I need a new Dataset without tuple, I have an object3, that have some vars from object1 and object2, then I have this problem: 但是，我需要一个没有元组的新数据集，我有一个object3，它有一些来自object1和object2的变量，然后我有这个问题：

Encoder<MyOwnObject3> encoderObject3 = Encoders.bean(MyOwnObject3.class);
Dataset<MyOwnObject3> object3DS = joinObjectDS.map(tupleObject1Object2 -> {
    MyOwnObject1 myOwnObject1 = tupleObject1Object2._1();
    MyOwnObject2 myOwnObject2 = tupleObject1Object2._2();
    MyOwnObject3 myOwnObject3 = new MyOwnObject3(); //Sets all vars with start values
    //...
    //Sets data from object 1 and 2 to 3.
    //...
    return myOwnObject3;
}, encoderObject3);

Fails!... here is the error: 失败！...这是错误：

17/05/10 12:17:43 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 593, Column 72: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import

and over thousands error lines... 超过数千条错误线......

What can I do? 我能做什么？ I had tried: 我试过了：

Make my object only with String, int (or Integer) and double (or Double) (no more) 使用String，int（或Integer）和double（或Double）创建我的对象（不再）
use differents encoders like kryo or javaSerialization 使用不同的编码器，如kryo或javaSerialization
use JavaRDD (works! but very slowly) and use Dataframes with Rows (works, but i need to change many objects) 使用JavaRDD（工作！但非常慢）并使用带有行的Dataframes（工作，但我需要更改许多对象）
All my java objects are serializable 我的所有java对象都是可序列化的
use sparks 2.1.0 and 2.1.1, now I have 2.1.1 on my pom.xml 使用sparks 2.1.0和2.1.1，现在我的pom.xml上有2.1.1

I want to use Datasets, to use the speed from Dataframes and object sintax from JavaRDD... 我想使用数据集，使用来自Dataframes的速度和来自JavaRDD的对象sintax ......

Help? 救命？

Thanks 谢谢

Answer 1

Finally I found a solution, 最后我找到了解决方案，

I had a problem with the option inferSchema when my code was creating a Dataset. 当我的代码创建数据集时，我遇到了使用inferSchema选项的问题。 I have a String column that the option inferSchema return me an Integer column because all values are "numeric", but i need use them as String (like "0001", "0002"...) I need to do a schema, but I have many vars, then I write this with all my classes: 我有一个String列，选项inferSchema返回一个Integer列，因为所有值都是“numeric”，但我需要将它们用作String（如“0001”，“0002”......）我需要做一个架构，但是我有很多变量，然后我用我的所有类写这个：

List<StructField> fieldsObject1 = new ArrayList<>();
for (Field field : MyOwnObject1.class.getDeclaredFields()) {
    fieldsObject1.add(DataTypes.createStructField(
        field.getName(),
        CatalystSqlParser.parseDataType(field.getType().getSimpleName()),
        true)
    );
}
StructType schemaObject1 = DataTypes.createStructType(fieldsObject1);

Dataset<MyOwnObject1> object1DS = spark.read()
    .option("header","true")
    .option("delimiter",";")
    .schema(schemaObject1)
    .csv(pathToFile1)
    .as(encoderObject1);

Works fine. 工作正常。

The "best" solution would be this: “最佳”解决方案是这样的：

  Dataset<MyOwnObject1> object1DS = spark.read()
    .option("header","true")
    .option("delimiter",";")
    .schema(encoderObject1.schema())
    .csv(pathToFile1)
    .as(encoderObject1);

but encoderObject1.schema() returns me a Schema with vars in alphabetical order, not in original order, then this option fails when I read a csv. 但是encoderObject1.schema（）按字母顺序返回一个包含vars的Schema，而不是按原始顺序返回，然后当我读取csv时，此选项失败。 Maybe Encoders should return a schema with vars in original order and not in alphabetical order 也许编码器应该按原始顺序返回带有变量的模式，而不是按字母顺序返回

如何使用java对象将两个spark数据集连接到一个？

问题描述

1 个解决方案

解决方案1
-1 已采纳 2017-05-11 15:27:32

如何使用java对象将两个spark数据集连接到一个？

问题描述

1 个解决方案

解决方案1 -1 已采纳 2017-05-11 15:27:32

解决方案1
-1 已采纳 2017-05-11 15:27:32