简体   繁体   English

映射减少编程错误

[英]Map-Reduce Programming Error

My input is many text files. 我的输入是很多文本文件。 I want my map-reduce program to write all the files-names and the associated sentences with the file names in one output file, where I want to just emit the file-name(key) and the associated sentences(value) from the mapper. 我希望我的map-reduce程序在一个输出文件中写入所有文件名和相关句子以及文件名,在这里我只想从映射器中发出文件名(关键字)和相关句子(值) 。 The reducer will collect the key and all the values and write the file-name and their associated sentences in the output. 精简器将收集键和所有值,并将文件名及其关联的句子写入输出中。

Here is the code of my mapper and reducer: 这是我的映射器和化简器的代码:

public class WordCount {
    public static class Map extends MapReduceBase implements Mapper<LongWritable,
    Text, Text, Text> {
        public void map(LongWritable key, Text value, OutputCollector<Text,Text>
        output, Reporter reporter) throws IOException {
            String filename = new String();
            FileSplit filesplit = (FileSplit)reporter.getInputSplit();
            filename=filesplit.getPath().getName();
            output.collect(new Text(filename), value);
        }
    }
    public static class Reduce extends MapReduceBase implements Reducer<Text, Text,
    Text, Text> {
        public void reduce(Text key, Iterable<Text> values, OutputCollector<Text,
        Text> output, Reporter reporter) throws IOException {
            StringBuilder builder = new StringBuilder();
            for(Text value : values) {
                String str = value.toString();
                builder.append(str);
            }
            String valueToWrite=builder.toString();
            output.collect(key, new Text(valueToWrite));
        }
        @Override
        public void reduce(Text arg0, Iterator<Text> arg1,
        OutputCollector<Text, Text> arg2, Reporter arg3)
        throws IOException {
        }
    }
    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(WordCount.class);
        conf.setJobName("wordcount");
        conf.setMapperClass(Map.class);
        conf.setReducerClass(Reduce.class);
        conf.setJarByClass(WordCount.class);
        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(Text.class);
        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);
        conf.setNumReduceTasks(1);
        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));
        JobClient.runJob(conf);
    }
}

The output is as follows: 输出如下:

14/03/21 00:38:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library   
for your platform... using builtin-java classes where applicable
14/03/21 00:38:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the 
arguments. Applications should implement Tool for the same.
14/03/21 00:38:27 WARN mapred.JobClient: No job jar file set.  User classes may not  
be found. See JobConf(Class) or JobConf#setJar(String).
14/03/21 00:38:27 WARN snappy.LoadSnappy: Snappy native library not loaded
14/03/21 00:38:27 INFO mapred.FileInputFormat: Total input paths to process : 2
14/03/21 00:38:27 INFO mapred.JobClient: Running job: job_local_0001
14/03/21 00:38:27 INFO util.ProcessTree: setsid exited with exit code 0
14/03/21 00:38:27 INFO mapred.Task:  Using ResourceCalculatorPlugin : 
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4911b910
14/03/21 00:38:27 INFO mapred.MapTask: numReduceTasks: 1
14/03/21 00:38:27 INFO mapred.MapTask: io.sort.mb = 100
14/03/21 00:38:27 INFO mapred.MapTask: data buffer = 79691776/99614720
14/03/21 00:38:27 INFO mapred.MapTask: record buffer = 262144/327680
14/03/21 00:38:27 INFO mapred.MapTask: Starting flush of map output
14/03/21 00:38:27 INFO mapred.MapTask: Finished spill 0
14/03/21 00:38:27 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And  
is in the process of commiting
14/03/21 00:38:28 INFO mapred.JobClient:  map 0% reduce 0%
14/03/21 00:38:30 INFO mapred.LocalJobRunner:  
file:/root/Desktop/wordcount/sample.txt:0+5371
14/03/21 00:38:30 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
14/03/21 00:38:30 INFO mapred.Task:  Using ResourceCalculatorPlugin :  
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1f8166e5
14/03/21 00:38:30 INFO mapred.MapTask: numReduceTasks: 1
14/03/21 00:38:30 INFO mapred.MapTask: io.sort.mb = 100
14/03/21 00:38:30 INFO mapred.MapTask: data buffer = 79691776/99614720
14/03/21 00:38:30 INFO mapred.MapTask: record buffer = 262144/327680
14/03/21 00:38:30 INFO mapred.MapTask: Starting flush of map output
14/03/21 00:38:30 INFO mapred.MapTask: Finished spill 0
14/03/21 00:38:30 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And      
is in the process of commiting
14/03/21 00:38:31 INFO mapred.JobClient:  map 100% reduce 0%
14/03/21 00:38:33 INFO mapred.LocalJobRunner:  
file:/root/Desktop/wordcount/sample.txt~:0+587
14/03/21 00:38:33 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.
14/03/21 00:38:33 INFO mapred.Task:  Using ResourceCalculatorPlugin : 
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3963b3e
14/03/21 00:38:33 INFO mapred.LocalJobRunner: 
14/03/21 00:38:33 INFO mapred.Merger: Merging 2 sorted segments
14/03/21 00:38:33 INFO mapred.Merger: Down to the last merge-pass, with 2 segments  
left of total size: 7549 bytes
14/03/21 00:38:33 INFO mapred.LocalJobRunner: 
14/03/21 00:38:33 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And  
is in the process of commiting
14/03/21 00:38:33 INFO mapred.LocalJobRunner: 
14/03/21 00:38:33 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to 
commit now
14/03/21 00:38:33 INFO mapred.FileOutputCommitter: Saved output of task  
'attempt_local_0001_r_000000_0' to file:/root/Desktop/wordcount/output
14/03/21 00:38:36 INFO mapred.LocalJobRunner: reduce > reduce
14/03/21 00:38:36 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
14/03/21 00:38:37 INFO mapred.JobClient:  map 100% reduce 100%
14/03/21 00:38:37 INFO mapred.JobClient: Job complete: job_local_0001
14/03/21 00:38:37 INFO mapred.JobClient: Counters: 21
14/03/21 00:38:37 INFO mapred.JobClient:   File Input Format Counters 
14/03/21 00:38:37 INFO mapred.JobClient:     Bytes Read=5958
14/03/21 00:38:37 INFO mapred.JobClient:   File Output Format Counters 
14/03/21 00:38:37 INFO mapred.JobClient:     Bytes Written=8
14/03/21 00:38:37 INFO mapred.JobClient:   FileSystemCounters
14/03/21 00:38:37 INFO mapred.JobClient:     FILE_BYTES_READ=26020
14/03/21 00:38:37 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=117337
14/03/21 00:38:37 INFO mapred.JobClient:   Map-Reduce Framework
14/03/21 00:38:37 INFO mapred.JobClient:     Map output materialized bytes=7557
14/03/21 00:38:37 INFO mapred.JobClient:     Map input records=122
14/03/21 00:38:37 INFO mapred.JobClient:     Reduce shuffle bytes=0
14/03/21 00:38:37 INFO mapred.JobClient:     Spilled Records=244
14/03/21 00:38:37 INFO mapred.JobClient:     Map output bytes=7301
14/03/21 00:38:37 INFO mapred.JobClient:     Total committed heap usage  
(bytes)=954925056
14/03/21 00:38:37 INFO mapred.JobClient:     CPU time spent (ms)=0
14/03/21 00:38:37 INFO mapred.JobClient:     Map input bytes=5958
14/03/21 00:38:37 INFO mapred.JobClient:     SPLIT_RAW_BYTES=185
14/03/21 00:38:37 INFO mapred.JobClient:     Combine input records=0
14/03/21 00:38:37 INFO mapred.JobClient:     Reduce input records=0
14/03/21 00:38:37 INFO mapred.JobClient:     Reduce input groups=2
14/03/21 00:38:37 INFO mapred.JobClient:     Combine output records=0
14/03/21 00:38:37 INFO mapred.JobClient:     Physical memory (bytes) snapshot=0
14/03/21 00:38:37 INFO mapred.JobClient:     Reduce output records=0
14/03/21 00:38:37 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0
14/03/21 00:38:37 INFO mapred.JobClient:     Map output records=122

When I run the above mapper and reducer with the same configuration of inputformat ( keyvaluetextinputformat.class ) it does not write anything in the output. 当我以相同的inputformat( keyvaluetextinputformat.class )配置运行上述mapper和reducer时,它不会在输出中写入任何内容。

What should I change to achieve my goal? 我应该改变什么来实现自己的目标?

KeyValueTextInputFormat is not correct input format for your case. KeyValueTextInputFormat不是适合您的情况的正确输入格式。 If you want to use this input format, each line in your input should contain a key,value pair separated by user specified delimiter or tab by default.But in your case input is "Set of files" and you want output of job to be "filename,content of file". 如果要使用此输入格式,则默认情况下,输入中的每一行都应包含一个键,值对,并由用户指定的定界符或制表符分隔。但是在这种情况下,输入为“文件集”,并且希望将作业输出为“文件名,文件内容”。

One of ways to achieve this would be to use TextInputFormat as input format. 实现此目的的方法之一是使用TextInputFormat作为输入格式。 I have tested below and it works. 我已经在下面进行了测试,并且可以正常工作。

Get file name and content of file in map function 在地图功能中获取文件名和文件内容

   public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 
    {
          String filename = new String();
          FileSplit filesplit = (FileSplit)context.getInputSplit();
          filename=filesplit.getPath().getName();

          context.write(new Text(filename), new Text(value));

    }

In reduce function we build string of all values which will be contents of file 在reduce函数中,我们构建所有值的字符串,这些值将成为文件的内容

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException 
    {
    StringBuilder builder= new StringBuilder();
        for (Text value : values) 
        {
            String str = value.toString();
            builder.append(str);            
        }
        String valueToWrite= builder.toString();
     context.write(key, new Text(valueToWrite));   
    }    
}

Finally in job driver class set inputformat to our custom format and no of reducers to 1 最后,在作业驱动程序类中,将inputformat设置为我们的自定义格式,并将reducer的数量设置为1

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setMapperClass(myMapper.class); 
        job.setReducerClass(myReducer.class);
        job.setNumReduceTasks(1);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM