Hadoop Map减少散列程序

Question

I have written a Map Reduce Program in Hadoop for hashing all the records of the file, and appending the hased value as an additional attribute to each record and then output to Hadoop file system This is the code i have written 我在Hadoop中编写了Map Reduce程序来散列文件的所有记录，并将hased值作为附加属性附加到每个记录，然后输出到Hadoop文件系统这是我写的代码

public class HashByMapReduce
{
public static class LineMapper extends Mapper<Text, Text, Text, Text>
{
    private Text word = new Text();

    public void map(Text key, Text value, Context context) throws IOException,    InterruptedException
      {
        key.set("single")
        String line = value.toString();
            word.set(line);
            context.write(key, line);

    }
}
public static class LineReducer
extends Reducer<Text,Text,Text,Text>
{
    private Text result = new Text();
    public void reduce(Text key, Iterable<Text> values,
    Context context
    ) throws IOException, InterruptedException
    {
        String translations = "";
        for (Text val : values)
        {
            translations = val.toString()+","+String.valueOf(hash64(val.toString())); //Point of Error 

        result.set(translations);
        context.write(key, result);
        }
    }
}
public static void main(String[] args) throws Exception
{
    Configuration conf = new Configuration();
    Job job = new Job(conf, "Hashing");
    job.setJarByClass(HashByMapReduce.class);
    job.setMapperClass(LineMapper.class);
    job.setReducerClass(LineReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    job.setInputFormatClass(KeyValueTextInputFormat.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

I have written this code with the logic that Each line is read by the Map method which assigns all value to a single key which then passes to same Reducer method. 我已经编写了这个代码，其逻辑是每个行都由Map方法读取，该方法将所有值分配给单个键，然后传递给相同的Reducer方法。 which the passes each values to hash64() function. 其中每个值都传递给hash64（）函数。

But i see its passing a null value(empty value) to hash function. 但我看到它将一个空值（空值）传递给哈希函数。 I am not unable to figure it out why? 我无法弄明白为什么？ Thanks in advance 提前致谢

Answer 1

The cause of the problem is most probably due to the use of KeyValueTextInputFormat . 问题的原因很可能是由于使用了KeyValueTextInputFormat 。 From Yahoo tutorial : 来自雅虎教程：

  InputFormat:          Description:       Key:                     Value:

  TextInputFormat       Default format;    The byte offset          The line contents 
                        reads lines of     of the line                            
                        text files

  KeyValueInputFormat   Parses lines       Everything up to the     The remainder of                      
                        into key,          first tab character      the line
                        val pairs

It's breaking your input lines wrt tab character. 它打破了你的输入行和tab字符。 I suppose there is no tab in your lines. 我想你的行中没有tab 。 As a result the key in the LineMapper is a whole line while nothing is being passed as value ( not sure null or empty ). 因此， LineMapper的key是一个整行，而没有任何东西作为value传递（不确定为null或为空）。

From your code I think you should better use TextInputFormat class as your inputformat which produces line offset as key and the complete line as value . 从您的代码中我认为最好使用TextInputFormat类作为inputformat，它将行偏移量作为key ，将整行作为value 。 This should solve your problem. 这应该可以解决您的问题。

EDIT : I run your code with following changes, and it seems to work fine: 编辑：我运行您的代码与以下更改，它似乎工作正常：

Changed inputformat to TextInputFormat and accordingly change declaration of the Mapper 将inputformat更改为TextInputFormat并相应地更改Mapper的声明
Added proper setMapOutputKeyClass & setMapOutputValueClass to the job . 为job添加了适当的setMapOutputKeyClass和setMapOutputValueClass 。 These are not mandatory but often creates problem on running. 这些不是强制性的，但通常会在运行时产生问题。
Removed your ket.set("single") and added a private outkey to the Mapper. 删除了你的ket.set("single")并为Mapper添加了一个私有outkey。
Since you provided no details of hash64 method, I used String.toUpperCase for testing. 由于您没有提供hash64方法的详细信息， hash64我使用String.toUpperCase进行测试。

If the issue persists, then I'm sure that your hash method hasn't handle null well. 如果问题仍然存在，那么我确信你的哈希方法没有很好地处理null 。

Full code : 完整代码：

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.apache.hadoop.mapreduce.Reducer;
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 public class HashByMapReduce {
 public static class LineMapper extends
        Mapper<LongWritable, Text, Text, Text> {
    private Text word = new Text();
    private Text outKey = new Text("single");

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        word.set(line);
        context.write(outKey, word);
    }
}

public static class LineReducer extends Reducer<Text, Text, Text, Text> {
    private Text result = new Text();

    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        String translations = "";
        for (Text val : values) {
            translations = val.toString() + ","
                    + val.toString().toUpperCase(); // Point of Error

            result.set(translations);
            context.write(key, result);
        }
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf, "Hashing");
    job.setJarByClass(HashByMapReduce.class);
    job.setMapperClass(LineMapper.class);
    job.setReducerClass(LineReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    job.setInputFormatClass(TextInputFormat.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

} }

Hadoop Map减少散列程序

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-09-23 10:12:54

Hadoop Map减少散列程序

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-09-23 10:12:54

解决方案1
2 已采纳 2014-09-23 10:12:54