I'm playing & learning hadoop MapReduce.
I'm trying to map data from a VCF file ( http://en.wikipedia.org/wiki/Variant_Call_Format ) : a VCF is a tab-delimited file starting with a (possibly large) header. This header is required to get the semantics of the records in the body.
I'd like to create a Mapper that would use those data. The header must be accessible from this Mapper in order to decode the lines.
From http://jayunit100.blogspot.fr/2013/07/hadoop-processing-headers-in-mappers.html , I've created this InputFormat , with a custom Reader :
public static class VcfInputFormat extends FileInputFormat<LongWritable, Text>
{
/* the VCF header is stored here */
private List<String> headerLines=new ArrayList<String>();
@Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException {
return new VcfRecordReader();
}
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
private class VcfRecordReader extends LineRecordReader
{
/* reads all lines starting with '#' */
@Override
public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
super.initialize(genericSplit, context);
List<String> headerLines=new ArrayList<String>();
while( super.nextKeyValue())
{
String row = super.getCurrentValue().toString();
if(!row.startsWith("#")) throw new IOException("Bad VCF header");
headerLines.add(row);
if(row.startsWith("#CHROM")) break;
}
}
}
}
Now, in the Mapper , is there a way to have a pointer to the VcfInputFormat.this.headerLines
in order to decode the lines ?
public static class VcfMapper
extends Mapper<LongWritable, Text, Text, IntWritable>{
public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException {
my.VcfCodec codec=new my.VcfCodec(???????.headerLines);
my.Variant variant =codec.decode(value.toString());
//(....)
}
}
I think your case is different from the example you linked to. In that case, the header is used inside the custom RecordReader
class in order to provide a single "current value" which is a line composed by all the filtered words, and which is passed to the mapper. However, in your case you want to use the headers information outside the RecordReader
, ie at you mapper, and that cannot be achieved.
I also think you can mimic the linked example behaviour by providing already processed information as well: by reading the headers, storing them and then, when getting the current value, your mapper can receive a my.VcfCodec
object instead of a Text
object (ie the getCurrentValue
method returns a my.VcfCodec
object). Your mapper could be something like...
public static class VcfMapper extends Mapper<LongWritable, my.VcfCodec, Text, IntWritable>{
public void map(LongWritable key, my.VcfCodec value, Context context ) throws IOException, InterruptedException {
// whatever you may want to do with the encoded data...
}
Your input format class is fine, as @frb said inputformat class will not be able to give differentiation between the meta data and the records.
One idea that I can suggest is ,
'##'
then parse the line and set the value to the static variable of the mapper class accordin to the property name in the current line.'##'
then simply pass the line to the mapperhope this helps..
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.