Apply function on a single column of a Dataset in Apache Spark using Java

Question

Say I have a Dataset:

Dataset<Row> sqlDF = this.spark.sql("SELECT first_name, last_name, age from persons";

this will return a Dataset with three columns: first_name, last_name, age.

I want to apply a function that adds 5 to the age column and returns a new Dataset with the same columns as the original Dataset but with the age value changed:

public int add_age(int old_age){
     return old_age + 5;
}

How do I go about doing this with Apache Spark on Java?

Answer 1

I solved this by making a StructType and adding the three columns to it, then mapping each to the new constructed row and applying the function to the line column age using RowFactory :

    StructType customStructType = new StructType();

    customStructType = customStructType.add("first_name", DataTypes.StringType, true);
    customStructType = customStructType.add("last_name", DataTypes.StringType, true);
    customStructType = customStructType.add("age", DataTypes.IntegerType, true);

    ExpressionEncoder<Row> customTypeEncoder = null;
    Dataset<Row> changed_data = sqlDF.map(row->{
          return RowFactory.create(row.get(0),row.get(1), add_age(row.get(2)));
            }, RowEncoder.apply(customStructType));

Apply function on a single column of a Dataset in Apache Spark using Java

Question

1 answers

solution1
0 ACCPTED 2019-10-20 21:26:53

Apply function on a single column of a Dataset in Apache Spark using Java

Question

1 answers

solution1 0 ACCPTED 2019-10-20 21:26:53

solution1
0 ACCPTED 2019-10-20 21:26:53