Spark binary column split into multiple columns

Question

I'm writing a Java application. I have a spark Dataset<MyObject> that results in a binary type column:

Dataset<MyObject> dataset = sparkSession.createDataset(someRDD, Encoders.javaSerialization(MyObject.class));
dataset.printSchema();

//root
//|-- value: binary (nullable = true)

MyObject has different (nested) fields, and I want to "explode" them in multiple columns in my Dataset. The new columns also need to be computed from multiple attributes in MyObject . As a solution, I could use .withColumn() and apply a UDF. Unfortunately, I don't know how to accept a binary type in the UDF and then convert it to MyObject . Any suggestions on how to do that?

Answer 1

Thanks to blackbishop's suggestion I solved it. Here is the complete solution:

You need to register the UDF:

UDFRegistration udfRegistration = sparkSession.sqlContext().udf();
udfRegistration.register("extractSomeLong", extractSomeLong(), DataTypes.LongType);

Declare and implement the UDF. The first argument must be byte[] and you need to convert the byte array to your object as indicated:

private static UDF1<byte[], Long> extractSomeLong() {
    return (byteArray) -> {
        if (byteArray != null) {
            ByteArrayInputStream in = new ByteArrayInputStream(byteArray);
            ObjectInputStream is = new ObjectInputStream(in);
            MyObject traceWritable = (MyObject) is.readObject();
            return traceWritable.getSomeLong();
        }
        else {
            return -1L;
        }
    };
}

And finally it can be used with:

Dataset<MyObject> data = sparkSession.createDataset(someRDD, Encoders.javaSerialization(MyObject.class));
Dataset<Row> processedData = data.withColumn( "ID", functions.callUDF( "extractSomeLong", new Column("columnName")))

Spark binary column split into multiple columns

Question

1 answers

solution1
1 2021-01-28 19:20:13

Spark binary column split into multiple columns

Question

1 answers

solution1 1 2021-01-28 19:20:13

solution1
1 2021-01-28 19:20:13