使用数组解析Json对象并使用Java中的Apache Spark映射到多个对

Question

I've googled it all day long and couldn't find straight answer, so ended up posting a question here. 我整天用Google搜索它，找不到直接的答案，所以最终在这里发布了一个问题。

I have a file containing line-delimited json objects: 我有一个包含以行分隔的json对象的文件：

{"device_id": "103b", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]}
{"device_id": "103b", "timestamp": 1435677490, "rooms": ["Office", "Lab"]}
{"device_id": "103b", "timestamp": 1436673850, "rooms": ["Office", "Foyer"]}

My goal is to parse this file with Apache Spark in Java. 我的目标是使用Java中的Apache Spark解析此文件。 I referenced How to Parsing CSV or JSON File with Apache Spark and so far I could successfully parse each line of json to JavaRDD using Gson . 我参考了如何使用Apache Spark解析CSV或JSON文件，到目前为止，我可以使用Gson成功地将JSON的每一行解析为JavaRDD。

JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> data = sc.textFile("fileName");
JavaRDD<JsonObject> records = data.map(new Function<String, JsonObject>() {
    public JsonObject call(String line) throws Exception {
        Gson gson = new Gson();
        JsonObject json = gson.fromJson(line, JsonObject.class);
        return json;
    }
});

Where I'm really stuck is I want to deserialize the "rooms" array so that it can fit to my class Event . 我真正陷入困境的地方是我想对“ rooms”数组进行反序列化，以使其适合我的Class Event 。

public class Event implements Serializable {
    public static final long serialVersionUID = 42L;
    private String deviceId;
    private int timestamp;
    private String room;
    // constructor , getters and setters 
}

In other words, from this line: 换句话说，从这一行：

{"device_id": "103b", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]}

I want to create two Event objects in Spark: 我想在Spark中创建两个Event对象：

obj1: deviceId = "103b", timestamp = 1436941050, room = "Office"
obj2: deviceId = "103b", timestamp = 1436941050, room = "Foyer"

I did my little search and tried flatMapVlue , but no luck... It threw me an error... 我做了一点搜索，然后尝试了flatMapVlue ，但是没有运气...这使我出错了...

JavaRDD<Event> events = records.flatMapValue(new Function<JsonObject, Iterable<Event>>() {
    public Iterable<Event> call(JsonObject json) throws Exception {
        JsonArray rooms = json.get("rooms").getAsJsonArray();
        List<Event> data = new LinkedList<Event>();
        for (JsonElement room : rooms) {
            data.add(new Event(json.get("device_id").getAsString(), json.get("timestamp").getAsInt(), room.toString()));
        }
        return data;
    }
});

I'm very new to Spark and Map/Reduce. 我是Spark和Map / Reduce的新手。 I would be grateful if you can help me out. 如果您能帮助我，我将不胜感激。 Thanks in advance! 提前致谢！

Answer 1

If you load json data into a DataFrame : 如果将json数据加载到DataFrame ：

DataFrame df = sqlContext.read().json("/path/to/json");

You could easily do this by explode . 您可以通过explode轻松地做到这一点。

df.select(
    df.col("device_id"),
    df.col("timestamp"),
    org.apache.spark.sql.functions.explode(df.col("rooms")).as("room")
);

For input: 输入：

{"device_id": "1", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]}
{"device_id": "2", "timestamp": 1435677490, "rooms": ["Office", "Lab"]}
{"device_id": "3", "timestamp": 1436673850, "rooms": ["Office", "Foyer"]}

You will get: 你会得到：

+---------+------+----------+
|device_id|  room| timestamp|
+---------+------+----------+
|        1|Office|1436941050|
|        1| Foyer|1436941050|
|        2|Office|1435677490|
|        2|   Lab|1435677490|
|        3|Office|1436673850|
|        3| Foyer|1436673850|
+---------+------+----------+

Answer 2

val formatrecord = records.map(fromJson[mapClass](_))

mapClass应该是用于在记录json中映射对象的case类。

使用数组解析Json对象并使用Java中的Apache Spark映射到多个对

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-07-13 08:37:44

解决方案2
1 2018-09-02 06:28:16

使用数组解析Json对象并使用Java中的Apache Spark映射到多个对

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-07-13 08:37:44

解决方案2 1 2018-09-02 06:28:16

解决方案1
2 已采纳 2016-07-13 08:37:44

解决方案2
1 2018-09-02 06:28:16