Hadoop Map Reduce CustomSplit / CustomRecordReader

Question

I have a huge text file and I wanted to split the file so that each chunk has 5 lines. 我有一个巨大的文本文件，我想分割文件，以便每个块有5行。 I implemented my own GWASInputFormat and GWASRecordReader classes. 我实现了自己的GWASInputFormat和GWASRecordReader类。 However my question is, in the following code(which I copied from http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/ ), inside the initialize() method I have the following lines 但是我的问题是，在下面的代码中（我从http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/复制），在initialize（）方法中我有以下几行

FileSplit split = (FileSplit) genericSplit;
final Path file = split.getPath();
Configuration conf = context.getConfiguration();

My question is, Is the file already split by the time the initialize() method is called in my GWASRecordReader class? 我的问题是，在我的GWASRecordReader类中调用initialize（）方法时文件是否已被拆分？ I thought that I was doing it(the split) in the GWASRecordReader class. 我以为是在GWASRecordReader类中进行的（拆分）。 Let me know if my thought process is right here. 如果我的思考过程就在这里，请告诉我。

package com.test;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;

public class GWASRecordReader extends RecordReader<LongWritable, Text> {

private final int NLINESTOPROCESS = 5;
private LineReader in;
private LongWritable key;
private Text value = new Text();
private long start = 0;
private long pos = 0;
private long end = 0;
private int maxLineLength;

public void close() throws IOException {
    if(in != null) {
        in.close();
    }
}

public LongWritable getCurrentKey() throws IOException, InterruptedException {
    return key;
}

public Text getCurrentValue() throws IOException, InterruptedException {
    return value;
}

public float getProgress() throws IOException, InterruptedException {
    if(start == end) {
        return 0.0f;
    }
    else {
        return Math.min(1.0f, (pos - start)/(float) (end - start));
    }
}

public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    final Path file = split.getPath();
    Configuration conf = context.getConfiguration();
    this.maxLineLength = conf.getInt("mapred.linerecordreader.maxlength",Integer.MAX_VALUE);
    FileSystem fs = file.getFileSystem(conf);
    start = split.getStart();
    end = start + split.getLength();
    System.out.println("---------------SPLIT LENGTH---------------------" + split.getLength());
    boolean skipFirstLine = false;
    FSDataInputStream filein = fs.open(split.getPath());

    if(start != 0) {
        skipFirstLine = true;
        --start;
        filein.seek(start);
    }

    in = new LineReader(filein, conf);
    if(skipFirstLine) {
        start += in.readLine(new Text(),0,(int)Math.min((long)Integer.MAX_VALUE, end - start));
    }
    this.pos = start;
}

public boolean nextKeyValue() throws IOException, InterruptedException {
    if (key == null) {
        key = new LongWritable();
    }

    key.set(pos);

    if (value == null) {
        value = new Text();
    }
    value.clear();
    final Text endline = new Text("\n");
    int newSize = 0;
    for(int i=0; i<NLINESTOPROCESS;i++) {
        Text v = new Text();
        while( pos < end) {
            newSize = in.readLine(v ,maxLineLength, Math.max((int)Math.min(Integer.MAX_VALUE, end - pos), maxLineLength));
            value.append(v.getBytes(), 0, v.getLength());
            value.append(endline.getBytes(),0,endline.getLength());
            if(newSize == 0) {
                break;
            }
            pos += newSize;
            if(newSize < maxLineLength) {
                break;
            }
        }
    }

    if(newSize == 0) {
        key = null;
        value = null;
        return false;
    } else {
        return true;
    }
}
}

Answer 1

Yes, the input file will already be split. 是的，输入文件已经拆分。 It basically goes like this: 它基本上是这样的：

your input file(s) -> InputSplit -> RecordReader -> Mapper...

Basically, InputSplit breaks the input into chunks, RecordReader breaks these chunks into key/value pairs. 基本上， InputSplit将输入分成多个块， RecordReader将这些块分成键/值对。 Note that InputSplit and RecordReader will be determined by the InputFormat you use. 请注意， InputSplit和RecordReader将由您使用的InputFormat确定。 For example, TextInputFormat uses FileSplit to break apart the input, then LineRecordReader which processes each individual line with the position as the key, and the line itself as the value. 例如， TextInputFormat使用FileSplit分隔输入，然后使用LineRecordReader处理每个单独的行，其中位置作为键，行本身作为值。 So in your GWASInputFormat you'll need to look into what kind of FileSplit you use to see what it's passing to GWASRecordReader . 因此，在您的GWASInputFormat您需要查看使用哪种FileSplit来查看它传递给GWASRecordReader 。

I would suggest looking into NLineInputFormat which "splits N lines of input as one split". 我建议调查NLineInputFormat “将N行输入分成一个分割”的NLineInputFormat 。 It may be able to do exactly what you are trying to do yourself. 它可能能够完全按照您自己想要的方式完成。

If you're trying to get 5 lines at a time as the value, and the line number of the first as a key, I would say you could do this with a customized NLineInputFormat and custom LineRecordReader . 如果您想一次获取5行作为值，而第一行的行号作为键，我想说您可以使用自定义的NLineInputFormat和自定义的LineRecordReader做到这LineRecordReader 。 You don't need to worry as much about the input split I think, since the input format can split it into those 5 line chunks. 我认为您不必担心输入拆分，因为输入格式可以将其拆分为这5行。 Your RecordReader would be very similar to LineRecordReader , but instead of getting the byte position of the start of the chunk, you would get the line number. 您的RecordReader与LineRecordReader非常相似，但不是获取块的开头的字节位置，而是获取行号。 So the code would be almost identical except for that small change. 所以代码几乎完全相同，除了那个小小的变化。 So you could essentially copy and paste NLineInputFormat and LineRecordReader but then have the input format use your record reader that gets the line number. 因此，您基本上可以复制并粘贴NLineInputFormat和LineRecordReader但输入格式使用您的记录阅读器来获取行号。 The code would be very similar. 代码非常相似。

Hadoop Map Reduce CustomSplit / CustomRecordReader

问题描述

1 个解决方案

解决方案1
7 2012-11-12 17:52:28

Hadoop Map Reduce CustomSplit / CustomRecordReader

问题描述

1 个解决方案

解决方案1 7 2012-11-12 17:52:28

解决方案1
7 2012-11-12 17:52:28