Hadoop read JSON from HDFS

Question

I'm trying to read an JSON file into my hadoop mapreduce algorithm. How can i do this? I've put a file 'testinput.json' into /input in my HDFS memory.

When calling the mapreduce i execute hadoop jar popularityMR2.jar populariy input output , with input stating the input directory in the dhfs memory.

public static class PopularityMapper extends Mapper<Object, Text, Text, Text>{


    protected void map(Object key, Text value,
                       Context context)
            throws IOException, InterruptedException {

        JSONParser jsonParser = new JSONParser();
        try {
            JSONObject jsonobject = (JSONObject) jsonParser.parse(new FileReader("hdfs://input/testinput.json"));
            JSONArray jsonArray = (JSONArray) jsonobject.get("votes");

            Iterator<JSONObject> iterator = jsonArray.iterator();
            while(iterator.hasNext()) {
                JSONObject obj = iterator.next();
                String song_id_rave_id = (String) obj.get("song_ID") + "," + (String) obj.get("rave_ID")+ ",";
                String preference = (String) obj.get("preference");
                System.out.println(song_id_rave_id + "||" + preference);
                context.write(new Text(song_id_rave_id), new Text(preference));
            }
        }catch(ParseException e) {
            e.printStackTrace();
        }
    }

}

My mapper function now looks like this. I read the file from the dhfs memory. But it always returns an error, file not found.

Does someone know how i can read this json into a jsonobject?

Thanks

Answer 1

FileReader cannot read from HDFS, only local Filesystem.
The filepath comes from the Job parameters - FileInputFormat.addInputPath(job, new Path(args[0]));

You wouldn't read the file in the Mapper class, anyway.

MapReduce defaults to read line-delimited files, so your JSON objects would have to be one per-line such as

{"votes":[]}
{"votes":[]}

From the mapper, you would parse the Text objects into JSONObject like so

 protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {

    JSONParser jsonParser = new JSONParser();
    try {
        JSONObject jsonobject = (JSONObject) jsonParser.parse(value.toString());
        JSONArray jsonArray = (JSONArray) jsonobject.get("votes");

If you only have one JSON object in the file, then you probably shouldn't be using MapReduce.

Otherwise, you would have to implement a WholeFileInputFormat and set that in the Job

job.setInputFormatClass(WholeFileInputFormat.class);

Answer 2

Tried reading the JSON from HDFS path using the following function using pydoop library and it is working as expected.Hope it helps.

import pydoop.hdfs as hdfs

def lreadline(inputJsonIterator):
    with hdfs.open(inputJsonIterator,mode='rt') as f:
        lines = f.read().split('\n')
    return lines

Hadoop read JSON from HDFS

Question

2 answers

solution1
1 ACCPTED 2019-10-25 18:59:09

solution2
-1 2020-01-03 12:42:18

Hadoop read JSON from HDFS

Question

2 answers

solution1 1 ACCPTED 2019-10-25 18:59:09

solution2 -1 2020-01-03 12:42:18

solution1
1 ACCPTED 2019-10-25 18:59:09

solution2
-1 2020-01-03 12:42:18