I'm trying to read an JSON file into my hadoop mapreduce algorithm. How can i do this? I've put a file 'testinput.json' into /input in my HDFS memory.
When calling the mapreduce i execute hadoop jar popularityMR2.jar populariy input output
, with input stating the input directory in the dhfs memory.
public static class PopularityMapper extends Mapper<Object, Text, Text, Text>{
protected void map(Object key, Text value,
Context context)
throws IOException, InterruptedException {
JSONParser jsonParser = new JSONParser();
try {
JSONObject jsonobject = (JSONObject) jsonParser.parse(new FileReader("hdfs://input/testinput.json"));
JSONArray jsonArray = (JSONArray) jsonobject.get("votes");
Iterator<JSONObject> iterator = jsonArray.iterator();
while(iterator.hasNext()) {
JSONObject obj = iterator.next();
String song_id_rave_id = (String) obj.get("song_ID") + "," + (String) obj.get("rave_ID")+ ",";
String preference = (String) obj.get("preference");
System.out.println(song_id_rave_id + "||" + preference);
context.write(new Text(song_id_rave_id), new Text(preference));
}
}catch(ParseException e) {
e.printStackTrace();
}
}
}
My mapper function now looks like this. I read the file from the dhfs memory. But it always returns an error, file not found.
Does someone know how i can read this json into a jsonobject?
Thanks
FileReader
cannot read from HDFS, only local Filesystem.
The filepath comes from the Job parameters - FileInputFormat.addInputPath(job, new Path(args[0]));
You wouldn't read the file in the Mapper class, anyway.
MapReduce defaults to read line-delimited files, so your JSON objects would have to be one per-line such as
{"votes":[]}
{"votes":[]}
From the mapper, you would parse the Text objects into JSONObject like so
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
JSONParser jsonParser = new JSONParser();
try {
JSONObject jsonobject = (JSONObject) jsonParser.parse(value.toString());
JSONArray jsonArray = (JSONArray) jsonobject.get("votes");
If you only have one JSON object in the file, then you probably shouldn't be using MapReduce.
Otherwise, you would have to implement a WholeFileInputFormat
and set that in the Job
job.setInputFormatClass(WholeFileInputFormat.class);
Tried reading the JSON from HDFS path using the following function using pydoop library and it is working as expected.Hope it helps.
import pydoop.hdfs as hdfs
def lreadline(inputJsonIterator):
with hdfs.open(inputJsonIterator,mode='rt') as f:
lines = f.read().split('\n')
return lines
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.