Hadoop 从 HDFS 读取 JSON

Question

我正在尝试将 JSON 文件读入我的 hadoop mapreduce 算法。 我怎样才能做到这一点？ 我已将文件“testinput.json”放入我的 HDFS memory 的 /input 中。

When calling the mapreduce i execute hadoop jar popularityMR2.jar populariy input output , with input stating the input directory in the dhfs memory.

public static class PopularityMapper extends Mapper<Object, Text, Text, Text>{


    protected void map(Object key, Text value,
                       Context context)
            throws IOException, InterruptedException {

        JSONParser jsonParser = new JSONParser();
        try {
            JSONObject jsonobject = (JSONObject) jsonParser.parse(new FileReader("hdfs://input/testinput.json"));
            JSONArray jsonArray = (JSONArray) jsonobject.get("votes");

            Iterator<JSONObject> iterator = jsonArray.iterator();
            while(iterator.hasNext()) {
                JSONObject obj = iterator.next();
                String song_id_rave_id = (String) obj.get("song_ID") + "," + (String) obj.get("rave_ID")+ ",";
                String preference = (String) obj.get("preference");
                System.out.println(song_id_rave_id + "||" + preference);
                context.write(new Text(song_id_rave_id), new Text(preference));
            }
        }catch(ParseException e) {
            e.printStackTrace();
        }
    }

}

我的映射器 function 现在看起来像这样。 我从 dhfs memory 读取文件。 但它总是返回错误，找不到文件。

有人知道我如何将这个 json 读入 jsonobject 吗？

谢谢

Answer 1

FileReader无法从 HDFS 读取，只能从本地文件系统读取。
文件路径来自 Job 参数 - FileInputFormat.addInputPath(job, new Path(args[0]));

无论如何，您不会读取 Mapper class 中的文件。

MapReduce 默认读取行分隔文件，因此您的 JSON 对象必须是每行一个，例如

{"votes":[]}
{"votes":[]}

从映射器中，您可以像这样将 Text 对象解析为 JSONObject

 protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {

    JSONParser jsonParser = new JSONParser();
    try {
        JSONObject jsonobject = (JSONObject) jsonParser.parse(value.toString());
        JSONArray jsonArray = (JSONArray) jsonobject.get("votes");

如果文件中只有一个 JSON object，那么您可能不应该使用 MapReduce。

否则，您必须实现WholeFileInputFormat并在 Job 中设置它

job.setInputFormatClass(WholeFileInputFormat.class);

Answer 2

尝试使用以下 function 使用 pydoop 库从 HDFS 路径读取 JSON，它按预期工作。希望它有所帮助。

import pydoop.hdfs as hdfs

def lreadline(inputJsonIterator):
    with hdfs.open(inputJsonIterator,mode='rt') as f:
        lines = f.read().split('\n')
    return lines

Hadoop 从 HDFS 读取 JSON

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-10-25 18:59:09

解决方案2
-1 2020-01-03 12:42:18

Hadoop 从 HDFS 读取 JSON

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-10-25 18:59:09

解决方案2 -1 2020-01-03 12:42:18

解决方案1
1 已采纳 2019-10-25 18:59:09

解决方案2
-1 2020-01-03 12:42:18