Mapfile作为MapReduce作业的输入

Question

I recently started to use Hadoop and I have a problem while using a Mapfile as a input to a MapReduce job. 我最近开始使用Hadoop，我在使用Mapfile作为MapReduce作业的输入时遇到了问题。

The following working code, writes a simple MapFile called "TestMap" in hdfs where there are three keys of type Text and three values of type BytesWritable. 下面的工作代码在hdfs中编写了一个名为“TestMap”的简单MapFile，其中有三个Text类型的键和三个类型为BytesWritable的值。

Here the contents of TestMap: 这里是TestMap的内容：

$ hadoop fs  -text /user/hadoop/TestMap/data
11/01/20 11:17:58 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/01/20 11:17:58 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
11/01/20 11:17:58 INFO compress.CodecPool: Got brand-new decompressor
A    01
B    02
C    03

Here is the program that creates the TestMap Mapfile: 以下是创建TestMap Mapfile的程序：

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.MapFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.IOUtils;

public class CreateMap {

    public static void main(String[] args) throws IOException{

        Configuration conf = new Configuration();
        FileSystem hdfs  = FileSystem.get(conf);

        Text key = new Text();
        BytesWritable value = new BytesWritable();
        byte[] data = {1, 2, 3};
        String[] strs = {"A", "B", "C"};
        int bytesRead;
        MapFile.Writer writer = null;

        writer = new MapFile.Writer(conf, hdfs, "TestMap", key.getClass(), value.getClass());
        try {
            for (int i = 0; i < 3; i++) {
                key.set(strs[i]);
                value.set(data, i, 1);
                writer.append(key, value);
                System.out.println(strs[i] + ":" + data[i] + " added.");
            }
        }
        catch (IOException e) {
            e.printStackTrace();
        }
        finally {
             IOUtils.closeStream(writer);
        }
    }
}

The simple MapReduce job that follows tries to increment by one the values of the mapfile: 后面的简单MapReduce作业尝试将mapfile的值递增1：

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.io.BytesWritable;


public class AddOne extends Configured implements Tool {

    public static class MapClass extends MapReduceBase

        implements Mapper<Text, BytesWritable, Text, Text> {

        public void map(Text key, BytesWritable value,
                        OutputCollector<Text, Text> output,
                        Reporter reporter) throws IOException {


            byte[] data = value.getBytes();
            data[0] += 1;
            value.set(data, 0, 1);
            output.collect(key, new Text(value.toString()));
        }
    }

    public static class Reduce extends MapReduceBase
        implements Reducer<Text, Text, Text, Text> {

        public void reduce(Text key, Iterator<Text> values,
                           OutputCollector<Text, Text> output,
                           Reporter reporter) throws IOException {

            output.collect(key, values.next());
        }
    }

    public int run(String[] args) throws Exception {
        Configuration conf = getConf();

        JobConf job = new JobConf(conf, AddOne.class);

        Path in = new Path("TestMap");
        Path out = new Path("output");
        FileInputFormat.setInputPaths(job, in);
        FileOutputFormat.setOutputPath(job, out);

        job.setJobName("AddOne");
        job.setMapperClass(MapClass.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormat(SequenceFileInputFormat.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        job.setOutputFormat(TextOutputFormat.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.set("key.value.separator.in.input.line", ":");


        JobClient.runJob(job);

        return 0;
    }

    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new AddOne(), args);

        System.exit(res);
    }
}

The runtime exception that I get is: 我得到的运行时异常是：

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.BytesWritable
    at AddOne$MapClass.map(AddOne.java:32)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

I don't understand why hadoop is trying to cast a LongWritable, since in my code I define the Mapper interface correctly( Mapper<Text, BytesWritable, Text, Text> ). 我不明白为什么hadoop试图强制转换LongWritable，因为在我的代码中我正确定义了Mapper接口（ Mapper<Text, BytesWritable, Text, Text> ）。

Could somebody help me? 有人能帮帮我吗？

Thank you very much 非常感谢你

Luca 卢卡

Answer 1

Your problem comes from the fact that, despite what the name tells you, a MapFile is not a file. 你的问题来自这样一个事实：尽管名称告诉你， MapFile 不是一个文件。

A MapFile is actually a directory that consists of two files: there's a "data" file, which is a SequenceFile containing the keys and values you write into it; MapFile实际上是一个由两个文件组成的目录：有一个“数据”文件，它是一个包含你写入的键和值的SequenceFile ; however, there is also an "index" file, which is a different SequenceFile containing a subsequence of the keys along with their offsets as LongWritables; 但是，还有一个“索引”文件，它是一个不同的SequenceFile，包含键的子序列及其作为LongWritables的偏移量; this index is loaded into memory by MapFile.Reader to let you quickly binary search to find the offset in the data file that will have the data you want when you do random access. MapFile.Reader将此索引加载到内存中，以便您快速进行二进制搜索，以便在数据文件中查找具有随机访问时所需数据的偏移量。

You're using the old "org.apache.hadoop.mapred" version of SequenceFileInputFormat . 您正在使用SequenceFileInputFormat的旧“org.apache.hadoop.mapred”版本。 It's not smart enough to know to only look at the data file when you tell it to look at a MapFile as input; 当你告诉它将MapFile看作输入时，知道只看数据文件是不够聪明的。 instead, it actually tries to use the data file and the index file as regular input files. 相反，它实际上尝试将数据文件和索引文件用作常规输入文件。 The data file will work correctly because the classes agree with what you specify, but the index file will throw the ClassCastException, because the index file values are all LongWritables. 数据文件将正常工作，因为类与您指定的内容一致，但索引文件将抛出ClassCastException，因为索引文件值都是LongWritables。

You have two options: you can start using the "org.apache.hadoop.mapreduce" version of SequenceFileInputFormat (thus changing other parts of your code), which does know enough about MapFiles to just look at the data file; 您有两个选择：您可以开始使用SequenceFileInputFormat的“org.apache.hadoop.mapreduce”版本（从而更改代码的其他部分），它对MapFiles有足够的了解，只需查看数据文件; or, instead, you can explicitly give the data file as the file you want as input. 或者，您可以明确地将数据文件作为您想要输入的文件。

Answer 2

I resolved the same issue using KeyValueTextInputFormat.class 我使用KeyValueTextInputFormat.class解决了同样的问题

Have mentioned the whole approach in 已经提到了整个方法

http://sanketraut.blogspot.in/2012/06/hadoop-example-setting-up-hadoop-on.html http://sanketraut.blogspot.in/2012/06/hadoop-example-setting-up-hadoop-on.html

Answer 3

其中一种方法可能是使用自定义InputFormat，其中一个记录用于整个MapFile块，并通过map（）中的键进行查找

Mapfile作为MapReduce作业的输入

问题描述

3 个解决方案

解决方案1
15 2011-01-23 18:55:04

解决方案2
0 2012-06-07 05:11:57

解决方案3
0 2011-05-04 10:05:01

Mapfile作为MapReduce作业的输入

问题描述

3 个解决方案

解决方案1 15 2011-01-23 18:55:04

解决方案2 0 2012-06-07 05:11:57

解决方案3 0 2011-05-04 10:05:01

解决方案1
15 2011-01-23 18:55:04

解决方案2
0 2012-06-07 05:11:57

解决方案3
0 2011-05-04 10:05:01