I'm writing a Java application. I have a spark Dataset<MyObject>
that results in a binary type column:
Dataset<MyObject> dataset = sparkSession.createDataset(someRDD, Encoders.javaSerialization(MyObject.class));
dataset.printSchema();
//root
//|-- value: binary (nullable = true)
MyObject
has different (nested) fields, and I want to "explode" them in multiple columns in my Dataset. The new columns also need to be computed from multiple attributes in MyObject
. As a solution, I could use .withColumn()
and apply a UDF. Unfortunately, I don't know how to accept a binary type in the UDF and then convert it to MyObject
. Any suggestions on how to do that?
Thanks to blackbishop's suggestion I solved it. Here is the complete solution:
You need to register the UDF:
UDFRegistration udfRegistration = sparkSession.sqlContext().udf();
udfRegistration.register("extractSomeLong", extractSomeLong(), DataTypes.LongType);
Declare and implement the UDF. The first argument must be byte[] and you need to convert the byte array to your object as indicated:
private static UDF1<byte[], Long> extractSomeLong() {
return (byteArray) -> {
if (byteArray != null) {
ByteArrayInputStream in = new ByteArrayInputStream(byteArray);
ObjectInputStream is = new ObjectInputStream(in);
MyObject traceWritable = (MyObject) is.readObject();
return traceWritable.getSomeLong();
}
else {
return -1L;
}
};
}
And finally it can be used with:
Dataset<MyObject> data = sparkSession.createDataset(someRDD, Encoders.javaSerialization(MyObject.class));
Dataset<Row> processedData = data.withColumn( "ID", functions.callUDF( "extractSomeLong", new Column("columnName")))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.