简体   繁体   English

使用数组解析Json对象并使用Java中的Apache Spark映射到多个对

[英]Parse Json Object with an Array and Map to Multiple Pairs with Apache Spark in Java

I've googled it all day long and couldn't find straight answer, so ended up posting a question here. 我整天用Google搜索它,找不到直接的答案,所以最终在这里发布了一个问题。

I have a file containing line-delimited json objects: 我有一个包含以行分隔的json对象的文件:

{"device_id": "103b", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]}
{"device_id": "103b", "timestamp": 1435677490, "rooms": ["Office", "Lab"]}
{"device_id": "103b", "timestamp": 1436673850, "rooms": ["Office", "Foyer"]}

My goal is to parse this file with Apache Spark in Java. 我的目标是使用Java中的Apache Spark解析此文件。 I referenced How to Parsing CSV or JSON File with Apache Spark and so far I could successfully parse each line of json to JavaRDD using Gson . 我参考了如何使用Apache Spark解析CSV或JSON文件,到目前为止,我可以使用Gson成功地将JSON的每一行解析为JavaRDD。

JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> data = sc.textFile("fileName");
JavaRDD<JsonObject> records = data.map(new Function<String, JsonObject>() {
    public JsonObject call(String line) throws Exception {
        Gson gson = new Gson();
        JsonObject json = gson.fromJson(line, JsonObject.class);
        return json;
    }
});

Where I'm really stuck is I want to deserialize the "rooms" array so that it can fit to my class Event . 我真正陷入困境的地方是我想对“ rooms”数组进行反序列化,以使其适合我的Class Event

public class Event implements Serializable {
    public static final long serialVersionUID = 42L;
    private String deviceId;
    private int timestamp;
    private String room;
    // constructor , getters and setters 
}

In other words, from this line: 换句话说,从这一行:

{"device_id": "103b", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]}

I want to create two Event objects in Spark: 我想在Spark中创建两个Event对象:

obj1: deviceId = "103b", timestamp = 1436941050, room = "Office"
obj2: deviceId = "103b", timestamp = 1436941050, room = "Foyer"

I did my little search and tried flatMapVlue , but no luck... It threw me an error... 我做了一点搜索,然后尝试了flatMapVlue ,但是没有运气...这使我出错了...

JavaRDD<Event> events = records.flatMapValue(new Function<JsonObject, Iterable<Event>>() {
    public Iterable<Event> call(JsonObject json) throws Exception {
        JsonArray rooms = json.get("rooms").getAsJsonArray();
        List<Event> data = new LinkedList<Event>();
        for (JsonElement room : rooms) {
            data.add(new Event(json.get("device_id").getAsString(), json.get("timestamp").getAsInt(), room.toString()));
        }
        return data;
    }
});

I'm very new to Spark and Map/Reduce. 我是Spark和Map / Reduce的新手。 I would be grateful if you can help me out. 如果您能帮助我,我将不胜感激。 Thanks in advance! 提前致谢!

If you load json data into a DataFrame : 如果将json数据加载到DataFrame

DataFrame df = sqlContext.read().json("/path/to/json");

You could easily do this by explode . 您可以通过explode轻松地做到这一点。

df.select(
    df.col("device_id"),
    df.col("timestamp"),
    org.apache.spark.sql.functions.explode(df.col("rooms")).as("room")
);

For input: 输入:

{"device_id": "1", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]}
{"device_id": "2", "timestamp": 1435677490, "rooms": ["Office", "Lab"]}
{"device_id": "3", "timestamp": 1436673850, "rooms": ["Office", "Foyer"]}

You will get: 你会得到:

+---------+------+----------+
|device_id|  room| timestamp|
+---------+------+----------+
|        1|Office|1436941050|
|        1| Foyer|1436941050|
|        2|Office|1435677490|
|        2|   Lab|1435677490|
|        3|Office|1436673850|
|        3| Foyer|1436673850|
+---------+------+----------+
val formatrecord = records.map(fromJson[mapClass](_))

mapClass应该是用于在记录json中映射对象的case类。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM