Hadoop MultipleInputs, TextInputFormat with different separators

Question

How do I easiest run multiple different mapper classes (using MultipleInputs ), all using the same input format, but with different input separators?

MultipleInput allows you to add multiple mappers, each with their own inputformat:

MultipleInputs.addInputPath(Job job, Path path,
  Class<? extends InputFormat> inputFormatClass,
  Class<? extends Mapper> mapperClass)

The input separator in TextInputFormat input is configured using by setting a config key textinputformat.record.delimiter in the job configuration. Handy!

However, that means all mappers added using MultipleInput.addInputPath(?, TextInputFormat.class, ?) will have to share the same separator, as specified by the textinputformat.record.delimiter config key.

Is there any way to get around this, without having to compile separate inputformat classes (extending TextInputFormat ) with a hardcoded separator for each mapper?

Answer 1

If you need multi input format, then you can create your own input format just to modify the delimiter and use them for your input paths. You can look here.

Your InputFormat class will override the createRecordReader() with your required delimiter.

Hadoop MultipleInputs, TextInputFormat with different separators

Question

1 answers

solution1
0 2015-07-01 15:13:12

Hadoop MultipleInputs, TextInputFormat with different separators

Question

1 answers

solution1 0 2015-07-01 15:13:12

solution1
0 2015-07-01 15:13:12