读取由于MapReduce中的/ n分为两行的记录

Question

I am trying to write a custom reader which serves me the purpose of reading a record (residing in two lines) with defined number of fields. 我正在尝试编写一个自定义阅读器，该阅读器可用于读取具有定义的字段数的记录（位于两行中）。

For Eg 例如

1,2,3,4("," can be there or not)
,5,6,7,8

My requirement is to read the record and push it into mapper as a single record like {1,2,3,4,5,6,7,8} . 我的要求是读取记录并将其作为单个记录（例如{1,2,3,4,5,6,7,8}推入映射器。 Please give some inputs. 请提供一些意见。

UPDATE: 更新：

public boolean nextKeyValue() throws IOException, InterruptedException {
    if(key == null) {
        key = new LongWritable();
    }

    //Current offset is the key
    key.set(pos); 

    if(value == null) {
        value = new Text();
    }

    int newSize = 0;
    int numFields = 0;
    Text temp = new Text();
    boolean firstRead = true;

    while(numFields < reqFields) {
        while(pos < end) {
            //Read up to the '\n' character and store it in 'temp'
            newSize = in.readLine(  temp, 
                                    maxLineLength, 
                                    Math.max((int) Math.min(Integer.MAX_VALUE, end - pos), 
                                             maxLineLength));

            //If 0 bytes were read, then we are at the end of the split
            if(newSize == 0) {
                break;
            }

            //Otherwise update 'pos' with the number of bytes read
            pos += newSize;

            //If the line is not too long, check number of fields
            if(newSize < maxLineLength) {
                break;
            }

            //Line too long, try again
            LOG.info("Skipped line of size " + newSize + " at pos " + 
                        (pos - newSize));
        }

        //Exit, since we're at the end of split
        if(newSize == 0) {
            break;
        }
        else {
            String record = temp.toString();
            StringTokenizer fields = new StringTokenizer(record,"|");

            numFields += fields.countTokens();

            //Reset 'value' if this is the first append
            if(firstRead) {
                value = new Text();
                firstRead = false;
            }

            if(numFields != reqFields) {
                value.append(temp.getBytes(), 0, temp.getLength());
            }
            else {
                value.append(temp.getBytes(), 0, temp.getLength());
            }
        }
    }

    if(newSize == 0) {
        key = null;
        value = null;
        return false;
    }
    else {
        return true;
    }
}

} }

This is the nextKeyValue method which I am trying to work on. 这是我正在尝试的nextKeyValue方法。 But still the mapper are not getting proper values. 但是，映射器仍未获得适当的值。 reqFields is 4 . reqFields是4 。

Answer 1

Look at how TextInputFormat is implemented. 查看如何实现TextInputFormat。 Look at it's superclass, FileInputFormat as well. 看看它的超类FileInputFormat。 You must subclass Either TextInputFormat of FileInputFormat and implement your own record handling. 您必须将FileInputFormat的TextInputFormat子类化，并实现自己的记录处理。

Thing to be aware when implementing any kind of file input format is this: 实现任何类型的文件输入格式时要注意的是：

Framework will split the file and give you the start offset and byte length of the piece of the file you have to read. Framework将分割文件，并为您提供您必须读取的文件片段的起始偏移量和字节长度。 It may very well happen that it splits the file right across some record. 它很可能将文件拆分到一些记录中。 That is why your reader must skip the bytes of the record at the beginning of the split if that record is not fully contained in the split, as well as read past the last byte of the split to read the whole last record if that one is not fully contained in the split. 因此，如果该记录未完全包含在拆分中，那么读者必须跳过该拆分开始处的记录字节；如果该记录是完整的，则读取该拆分的最后一个字节以读取整个最后一条记录的原因没有完全包含在拆分中。

For example, TextInoutFormat treats \\n characters as record delimiters so when it gets the split it skips the bytes until the first \\n character and read past the end of the split until the \\n character. 例如，TextInoutFormat将\\ n字符视为记录定界符，因此在进行拆分时，它将跳过字节，直到第一个\\ n字符为止，并在拆分结束后读取，直到\\ n字符为止。

As for the code example: 至于代码示例：

You need to ask yourself the following question: Say you open the file, seek to a random position and start reading forward. 您需要问自己以下问题：说您打开文件，寻找一个随机位置，然后开始阅读。 How do you detect the start of the record? 您如何检测记录的开始？ I don't see anything in your code that deals with that, and without it, you cannot write a good input format, because you don't know what are the record boundaries. 我在您的代码中看不到与此相关的任何内容，没有它，您将无法编写良好的输入格式，因为您不知道记录边界是什么。

Now it is still possible to make the input format read the whole file end to end by making the isSplittable(JobContext,Path) method return false. 现在，仍然可以通过使isSplittable（JobContext，Path）方法返回false来使输入格式从头到尾读取整个文件。 That makes the file read wholly by single map task which reduces parallelism. 这使得文件可以通过单个映射任务完全读取，从而减少了并行性。

Your inner while loop seems problematic since it's checking for lines that are too long and is skipping them. 您的内部while循环似乎有问题，因为它正在检查太长的行并跳过它们。 Given that your records are written using multiple lines, it can happen that you merge one part of one record and another part of another record when you read it. 假设您的记录是用多行写的，那么当您读取一条记录时，可能会合并其中一条记录的一部分和另一条记录的另一部分。

Answer 2

The string had to be tokenized using StringTokenizer and not split. 该字符串必须使用StringTokenizer标记化，而不是拆分。 The code has been updated with the new implmentation. 该代码已使用新的实现进行了更新。

读取由于MapReduce中的/ n分为两行的记录

问题描述

2 个解决方案

解决方案1
1 2015-01-22 12:54:27

解决方案2
1 已采纳 2015-01-25 06:38:02

读取由于MapReduce中的/ n分为两行的记录

问题描述

2 个解决方案

解决方案1 1 2015-01-22 12:54:27

解决方案2 1 已采纳 2015-01-25 06:38:02

解决方案1
1 2015-01-22 12:54:27

解决方案2
1 已采纳 2015-01-25 06:38:02