Hadoop多個輸入錯誤分組-雙向聯接練習

Question

我正在嘗試研究hadoop，並閱讀了很多有關如何進行自然連接的知識。 我有兩個包含鍵和信息的文件，我想將其顯示為（a，b，c）。

我的問題是，映射器正在為每個文件調用reducer。 我原本希望收到類似（10，[R1，S10，S22]）的內容（因為10鍵，1、10、22是具有10作為鍵的不同行的值，並且R和S在標記中，所以我可以識別他們來自哪個表）。

問題是我的減速器收到了（10，[S10，S22]），只有在處理完所有S文件后，我才得到另一個鍵值對，例如（10，[R1]）。 這意味着，它針對每個文件分別按鍵分組並調用化簡器

我不確定這種行為是否正確，是否必須以其他方式進行配置或者我做錯了什么。

我也是java的新手，所以代碼可能對您不利。

我避免使用TextPair數據類型，因為我自己還不能提出這個建議，我想這將是另一種有效的方式（以防萬一，您想知道）。 謝謝

根據WordCount示例運行hadoop 2.4.1。

import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.StringTokenizer;

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.lib.MultipleInputs;

public class TwoWayJoin {

    public static class FirstMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {

        public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);

            Text a = new Text();
            Text b = new Text();

            a.set(tokenizer.nextToken());
            b.set(tokenizer.nextToken());

            output.collect(b, relation);
        }
    }

    public static class SecondMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {

        public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);

            Text b = new Text();
            Text c = new Text();

            b.set(tokenizer.nextToken());
            c.set(tokenizer.nextToken());

            Text relation = new Text("S"+c.toString());

            output.collect(b, relation);

        }
    }

    public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
        public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

            ArrayList < Text > RelationS = new ArrayList < Text >() ;
            ArrayList < Text > RelationR = new ArrayList < Text >() ;

            while (values.hasNext()) {
                String relationValue = values.next().toString();
                if (relationValue.indexOf('R') >= 0){
                    RelationR.add(new Text(relationValue));
                } else {
                    RelationS.add(new Text(relationValue));
                }
            }

            for( Text r : RelationR ) {
                for (Text s : RelationS) {
                    output.collect(key, new Text(r + "," + key.toString() + "," + s));
                }
            }
        }
    }

    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(MultipleInputs.class);
        conf.setJobName("TwoWayJoin");

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(Text.class);

        conf.setCombinerClass(Reduce.class);
        conf.setReducerClass(Reduce.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        MultipleInputs.addInputPath(conf, new Path(args[0]), TextInputFormat.class, FirstMap.class);
        MultipleInputs.addInputPath(conf, new Path(args[1]), TextInputFormat.class, SecondMap.class);

        Path output = new Path(args[2]); 

        FileOutputFormat.setOutputPath(conf, output);

        FileSystem.get(conf).delete(output, true);

        JobClient.runJob(conf);

    }
}

R.txt

(a  b(key))
2   46
1   10
0   24
31  50
11  2
5   31
12  36
9   46
10  34
6   31

S.txt

(b(key)  c)
45  32
45  45
46  10
36  15
45  21
45  28
45  9
45  49
45  18
46  21
45  45
2   11
46  15
45  33
45  6
45  20
31  28
45  32
45  26
46  35
45  36
50  49
45  13
46  3
46  8
31  45
46  18
46  21
45  26
24  15
46  31
46  47
10  24
46  12
46  36

此代碼的輸出成功，但為空，因為我將Array R或Array S都為空。

如果我簡單地一個接一個地收集它們而不進行任何處理，那么我將映射所有行。

預期輸出為

key  "a,b,c"

Answer 1

問題出在組合器上。 請記住，組合器在地圖輸出上應用reduce函數。 因此，間接地是將reduce函數分別應用於您的R和S關系，這就是您在不同的reduce調用中獲得R和S關系的原因。 注釋掉

conf.setCombinerClass(Reduce.class);

並嘗試再次運行應該沒有任何問題。 順便說一句，僅當您感覺到映射輸出在完成排序和混洗后應用到輸入中的映射結果相同時，合並器功能才有用。

Hadoop多個輸入錯誤分組-雙向聯接練習

問題描述

1 個解決方案

解決方案1
1 已采納 2015-10-14 04:48:38

Hadoop多個輸入錯誤分組-雙向聯接練習

問題描述

1 個解決方案

解決方案1 1 已采納 2015-10-14 04:48:38

解決方案1
1 已采納 2015-10-14 04:48:38