简体   繁体   中英

map row in Dataset<Row> to object class Spark Java

Working with HBase and Parquet, I have written code to get values from HBase and map values to Object class but I am having trouble replicating this with Parquet using Dataset.

HBase:

JavaPairRDD<ImmutableBytesWritable, Result> data = sc.newAPIHadoopRDD(getHbaseConf(),
            TableInputFormat.class, ImmutableBytesWritable.class, Result.class);

JavaRDD<List<Tuple3<Long, Integer, Double>>> tempData = data
                .values()
                //Uses HBaseResultToSimple... class to parse the data.
                .map(value -> {
                    SimpleObject object = oParser.call(value);
                    // Get the sample property, remove leading and ending spaces and split it by comma
                    // to get each sample individually
                    List<Tuple2<String, Integer>> samples = zipWithIndex((object.getSamples().trim().split(",")));

                    // Gets the unique identifier for that sp.
                    Long sp = object.getPos();

                    // Calculates the hamming distance for this sp for each sample.
                    // i.e. 0|0 => 0, 0|1 => 1, 1|0 => 1, 1|1 => 2
                    return samples.stream().map(t -> {
                        String alleles = t._1();
                        Integer patient = t._2();

                        List<String> values = Arrays.asList(alleles.split("\\|"));

                        Double firstA = Double.parseDouble(values.get(0));
                        Double second = Double.parseDouble(values.get(1));

                        // Returns the initial sp id, p id and the distance in form of Tuple.
                        return new Tuple3<>(snp, patient, firstAllele + secondAllele);
                    }).collect(Collectors.toList());
                });

I read data from Parquet into Dataset but simple can't replicate above approach.

Dataset<Row> url = session.read().parquet(fileName);

I just need to know how to map the rows in Dataset<Row> to object class as I do with .map(value -> {... in the above approach.

Any help would be appreciated.

Option 1: Convert your Dataframe (aka Dataset<Row> ) into a typed Dataset. Assuming the class Data is a simple Java bean that fits to the structure of your parquet file, you can use:

Dataset<Data> ds = inputDf.as(Encoders.bean(Data.class));

On this dataset, you can use a map function with typed access:

Dataset<String> ds2 = ds.map( d -> d.getA(), Encoders.STRING());

(In this example, I assume that the class Data has a property called A of type String.)

Option 2: Another option without the need of an extra class would be to use the Row object directly in your map call:

Dataset<String> ds3 = inputDf.map(r -> r.getString(0), Encoders.STRING());

(Again, I assume that the first column is a string.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM