简体   繁体   中英

Hadoop Map Reduce CustomSplit/CustomRecordReader

I have a huge text file and I wanted to split the file so that each chunk has 5 lines. I implemented my own GWASInputFormat and GWASRecordReader classes. However my question is, in the following code(which I copied from http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/ ), inside the initialize() method I have the following lines

FileSplit split = (FileSplit) genericSplit;
final Path file = split.getPath();
Configuration conf = context.getConfiguration();

My question is, Is the file already split by the time the initialize() method is called in my GWASRecordReader class? I thought that I was doing it(the split) in the GWASRecordReader class. Let me know if my thought process is right here.

package com.test;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;

public class GWASRecordReader extends RecordReader<LongWritable, Text> {

private final int NLINESTOPROCESS = 5;
private LineReader in;
private LongWritable key;
private Text value = new Text();
private long start = 0;
private long pos = 0;
private long end = 0;
private int maxLineLength;

public void close() throws IOException {
    if(in != null) {
        in.close();
    }
}

public LongWritable getCurrentKey() throws IOException, InterruptedException {
    return key;
}

public Text getCurrentValue() throws IOException, InterruptedException {
    return value;
}

public float getProgress() throws IOException, InterruptedException {
    if(start == end) {
        return 0.0f;
    }
    else {
        return Math.min(1.0f, (pos - start)/(float) (end - start));
    }
}

public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    final Path file = split.getPath();
    Configuration conf = context.getConfiguration();
    this.maxLineLength = conf.getInt("mapred.linerecordreader.maxlength",Integer.MAX_VALUE);
    FileSystem fs = file.getFileSystem(conf);
    start = split.getStart();
    end = start + split.getLength();
    System.out.println("---------------SPLIT LENGTH---------------------" + split.getLength());
    boolean skipFirstLine = false;
    FSDataInputStream filein = fs.open(split.getPath());

    if(start != 0) {
        skipFirstLine = true;
        --start;
        filein.seek(start);
    }

    in = new LineReader(filein, conf);
    if(skipFirstLine) {
        start += in.readLine(new Text(),0,(int)Math.min((long)Integer.MAX_VALUE, end - start));
    }
    this.pos = start;
}

public boolean nextKeyValue() throws IOException, InterruptedException {
    if (key == null) {
        key = new LongWritable();
    }

    key.set(pos);

    if (value == null) {
        value = new Text();
    }
    value.clear();
    final Text endline = new Text("\n");
    int newSize = 0;
    for(int i=0; i<NLINESTOPROCESS;i++) {
        Text v = new Text();
        while( pos < end) {
            newSize = in.readLine(v ,maxLineLength, Math.max((int)Math.min(Integer.MAX_VALUE, end - pos), maxLineLength));
            value.append(v.getBytes(), 0, v.getLength());
            value.append(endline.getBytes(),0,endline.getLength());
            if(newSize == 0) {
                break;
            }
            pos += newSize;
            if(newSize < maxLineLength) {
                break;
            }
        }
    }

    if(newSize == 0) {
        key = null;
        value = null;
        return false;
    } else {
        return true;
    }
}
}

Yes, the input file will already be split. It basically goes like this:

your input file(s) -> InputSplit -> RecordReader -> Mapper...

Basically, InputSplit breaks the input into chunks, RecordReader breaks these chunks into key/value pairs. Note that InputSplit and RecordReader will be determined by the InputFormat you use. For example, TextInputFormat uses FileSplit to break apart the input, then LineRecordReader which processes each individual line with the position as the key, and the line itself as the value. So in your GWASInputFormat you'll need to look into what kind of FileSplit you use to see what it's passing to GWASRecordReader .

I would suggest looking into NLineInputFormat which "splits N lines of input as one split". It may be able to do exactly what you are trying to do yourself.

If you're trying to get 5 lines at a time as the value, and the line number of the first as a key, I would say you could do this with a customized NLineInputFormat and custom LineRecordReader . You don't need to worry as much about the input split I think, since the input format can split it into those 5 line chunks. Your RecordReader would be very similar to LineRecordReader , but instead of getting the byte position of the start of the chunk, you would get the line number. So the code would be almost identical except for that small change. So you could essentially copy and paste NLineInputFormat and LineRecordReader but then have the input format use your record reader that gets the line number. The code would be very similar.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM