简体   繁体   中英

Map with custom object as key to DataFrame in Apache Spark

I'm having trouble with creating a DataFrame from an RDD.

To start off, I'm using Spark to create the data I'm using (via simulations on the workers) and in return I get Report objects.

These Report object consist of two HashMaps where the keys are near identical between the maps and custom made and the values are Integer / Double. Worth noting is that I currently need these keys and maps to efficiently add and update the values during the simulations, so changing this to a "flat" object may lose a lot of efficiency.

public class Key implements Serializable, Comparable<Key> {

    private final States states;
    private final String event;
    private final double age;

    ...
}

And the States are

public class States implements Serializable, Comparable<States> {

    private String stateOne;
    private String stateTwo;

    ...
}

The states used to be Enums, but as it turns out, DataFrame doesn't like that. (The Strings are still set from Enums to ensure the values are correct.)

The problem is that I want to convert these maps to DataFrames so that I can use SQL etc to manipulate/filter the data.

I am able to create DataFrames by creating a Bean like so

public class Event implements Serializable {

    private String stateOne;
    private String stateTwo;

    private String event;
    private Double age;

    private Integer value;

    ...
}

with getters and setters, but is there a way that I can just use Tuple2 (or something similar) to create my DataFrame? Which could even give me a nice structure for the db?

I have tried using Tuple2 like this

JavaRDD<Report> reports = dataSet.map(new SimulationFunction(REPLICATIONS_PER_WORKER)).cache();

JavaRDD<Tuple2<Key, Integer>> events = reports.flatMap(new FlatMapFunction<Report, Tuple2<Key, Integer>>() {
    @Override
    public Iterable<Tuple2<Key, Integer>> call(Report t) throws Exception {
        List<Tuple2<Key, Integer>> list = new ArrayList<>(t.getEvents().size());
        for(Entry<Key, Integer> entry : t.getEvents().entrySet()) {

            list.add(new Tuple2<>(entry.getKey(), entry.getValue()));
        }

        return list;
    }
});

DataFrame schemaEvents = sqlContext.createDataFrame(events, ????);

But I don't know what to put where the question marks are.

Hopefully I've made myself clear enough and that you'll be able to shed some light on this. Thank you in advance!

As zero323 says, it's not possible to do what I'm trying to do. I'll just stick with the beans from now on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM