简体   繁体   中英

Hadoop MultipleInputs, TextInputFormat with different separators

How do I easiest run multiple different mapper classes (using MultipleInputs ), all using the same input format, but with different input separators?

MultipleInput allows you to add multiple mappers, each with their own inputformat:

MultipleInputs.addInputPath(Job job, Path path,
  Class<? extends InputFormat> inputFormatClass,
  Class<? extends Mapper> mapperClass)

The input separator in TextInputFormat input is configured using by setting a config key textinputformat.record.delimiter in the job configuration. Handy!

However, that means all mappers added using MultipleInput.addInputPath(?, TextInputFormat.class, ?) will have to share the same separator, as specified by the textinputformat.record.delimiter config key.

Is there any way to get around this, without having to compile separate inputformat classes (extending TextInputFormat ) with a hardcoded separator for each mapper?

If you need multi input format, then you can create your own input format just to modify the delimiter and use them for your input paths. You can look here.

Your InputFormat class will override the createRecordReader() with your required delimiter.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM