How are files named in Hadoop when using MultipleOutputs?

Question

I'm using MultipleOutputs to write three files ie, name, attrib, others and using 6 redcuers. I get these files in my Output Directory:

attrib-r-00003  name-r-00004   part-r-00000  part-r-00002  part-r-00004  _SUCCESS
_logs           other-r-00001  part-r-00001  part-r-00003  part-r-00005

My Question is, how are these files named(As in why is a -r-0003 appended to attrib file, is it that the task 0003 compiled this file?). I'm currently running Hadoop in Pseudo Mode, on a real cluster would there be a need to combine files(ie would attrib have diffrent files by diff reducers) ? Also, is there a way that i can remove -r-xxxxx from my output file name ?

PS my knowledge of Hadoop is pretty limited.

Answer 1

MultipleOutputs allows you to write data to files whose names are derived from the output keys and values, or in fact from an arbitrary string. This allows each reducer (or mapper in a map-only job) to create more than a single file. File names are of the form name-m-nnnnn for map outputs and name-r-nnnnn for reduce outputs, where name is an arbitrary name that is set by the program, and nnnnn is an integer designating the part number, starting from zero. The part number ensures that outputs written from different partitions (mappers or reducers) do not collide in the case of the same name.

Yes you have to combine files(ie would attrib have diffrent files by diff reducers) in case you want single file as output. You can combine the files after the job has completed. You can look into this method for appending the files. public FSDataOutputStream append(Path f) throws IOException .

How are files named in Hadoop when using MultipleOutputs?

Question

1 answers

solution1
1 2013-10-08 11:21:14

How are files named in Hadoop when using MultipleOutputs?

Question

1 answers

solution1 1 2013-10-08 11:21:14

solution1
1 2013-10-08 11:21:14