简体   繁体   中英

hadoop mapreduce: handling a text file with a header

I'm playing & learning hadoop MapReduce.

I'm trying to map data from a VCF file ( http://en.wikipedia.org/wiki/Variant_Call_Format ) : a VCF is a tab-delimited file starting with a (possibly large) header. This header is required to get the semantics of the records in the body.

http://wiki.bits.vib.be/index.php/NGS_Exercise.5

I'd like to create a Mapper that would use those data. The header must be accessible from this Mapper in order to decode the lines.

From http://jayunit100.blogspot.fr/2013/07/hadoop-processing-headers-in-mappers.html , I've created this InputFormat , with a custom Reader :

  public static class VcfInputFormat extends FileInputFormat<LongWritable, Text>
    {
    /* the VCF header is stored here */
    private List<String> headerLines=new ArrayList<String>();

    @Override
    public RecordReader<LongWritable, Text> createRecordReader(InputSplit split,
            TaskAttemptContext context) throws IOException,
            InterruptedException {
        return new VcfRecordReader();
        }  
    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
        }

     private class VcfRecordReader extends LineRecordReader
        {
        /* reads all lines starting with '#' */
         @Override
        public void initialize(InputSplit genericSplit,
                TaskAttemptContext context) throws IOException {
            super.initialize(genericSplit, context);
            List<String> headerLines=new ArrayList<String>();
            while( super.nextKeyValue())
                {
                String row = super.getCurrentValue().toString();
                if(!row.startsWith("#")) throw new IOException("Bad VCF header");
                headerLines.add(row);
                if(row.startsWith("#CHROM")) break;
                }
            }
        }
    }

Now, in the Mapper , is there a way to have a pointer to the VcfInputFormat.this.headerLines in order to decode the lines ?

  public static class VcfMapper
       extends Mapper<LongWritable, Text, Text, IntWritable>{

    public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException {
      my.VcfCodec codec=new my.VcfCodec(???????.headerLines);
      my.Variant variant =codec.decode(value.toString());
      //(....)
    }
  }

I think your case is different from the example you linked to. In that case, the header is used inside the custom RecordReader class in order to provide a single "current value" which is a line composed by all the filtered words, and which is passed to the mapper. However, in your case you want to use the headers information outside the RecordReader , ie at you mapper, and that cannot be achieved.

I also think you can mimic the linked example behaviour by providing already processed information as well: by reading the headers, storing them and then, when getting the current value, your mapper can receive a my.VcfCodec object instead of a Text object (ie the getCurrentValue method returns a my.VcfCodec object). Your mapper could be something like...

public static class VcfMapper extends Mapper<LongWritable, my.VcfCodec, Text, IntWritable>{
    public void map(LongWritable key, my.VcfCodec value, Context context ) throws IOException, InterruptedException {
        // whatever you may want to do with the encoded data...
}

Your input format class is fine, as @frb said inputformat class will not be able to give differentiation between the meta data and the records.

One idea that I can suggest is ,

  • Declare static global variables in the mapper class for every meta data property of the VCF file such as fileformat, date, source etc..
  • From the VcfInputFormat class read through lines , if the line start with '##' then parse the line and set the value to the static variable of the mapper class accordin to the property name in the current line.
  • if the line doesn't start with '##' then simply pass the line to the mapper
  • In mapper class, just parse the record contents and derive useful values with the help of static variables represents the meta data.

hope this helps..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM