Hadoop多個輸入

Question

我正在使用hadoop map reduce，我想計算兩個文件。 我的第一個Map / Reduce迭代給了我一個帶有ID號碼的文件，如下所示：

A 30
D 20

我的目標是使用文件中的ID與另一個文件關聯，並使用另一個三重奏輸出：ID，Number，Name，如下所示：

A ABC 30
D EFGH 20

但我不確定使用Map Reduce是否是最好的方法。 例如，使用文件讀取器讀取第二個輸入文件並通過ID獲取名稱會更好嗎？ 或者我可以使用Map Reduce嗎？

如果是這樣，我正在試圖找出方法。 我嘗試了一個MultipleInput解決方案：

MultipleInputs.addInputPath(job2, new Path(args[1]+"-tmp"),
    TextInputFormat.class, FlightsByCarrierMapper2.class);
MultipleInputs.addInputPath(job2, new Path("inputplanes"),
    TextInputFormat.class, FlightsModeMapper.class);

但我想不出任何解決方案將兩者結合起來並得到我想要的輸出。 我現在的方式是給我這樣的列表：

A ABC
A 30
B ABCD
C ABCDEF
D EFGH
D 20

在我最后減少之后我得到了這個：

N125DL  767-332
N125DL  7   , 
N126AT  737-76N
N126AT  19  , 
N126DL  767-332
N126DL  1   , 
N127DL  767-332
N127DL  7   , 
N128DL  767-332
N128DL  3

我想要這個：N127DL 7 767-332。 而且，我不希望那些沒有結合的。

這是我的減少類：

公共類FlightsByCarrierReducer2延伸減速機{

String merge = "";
protected void reduce(Text token, Iterable<Text> values, Context context) 
                            throws IOException, InterruptedException {

    int i = 0;  
    for(Text value:values)
    {
        if(i == 0){
            merge = value.toString()+",";
        }
        else{
            merge += value.toString();
        }
        i++;
    }

        context.write(token, new Text(merge));

}

}

更新：

http://stat-computing.org/dataexpo/2009/the-data.html這是我正在使用的例子。

我正在嘗試：TailNum和Canceled，它是（1或0）獲取對應於TailNum的模型名稱。 我的模型文件有TailNumb，Model和其他東西。 我目前的輸出是：

N193JB ERJ 190-100 IGW

N194DN 767-332

N19503 EMB-135ER

N19554 EMB-145LR

N195DN 767-332

N195DN 2

首先是鑰匙，第二是模型，取消航班的鑰匙，在模型下方出現

我想要一個三重鍵，取消的型號，因為我想要每個型號的取消數量

Answer 1

您可以使用ID作為兩個映射器的鍵來加入它們。 您可以像這樣編寫地圖任務

public void map(LongWritable k, Text value, Context context) throws IOException, InterruptedException
{
    //Get the line
    //split the line to get ID seperate
    //word1 = A 
    //word2 = 30
                //Likewise for A ABC
                   //word1 = A 
                  //word2 = ABC
    context.write(word1, word2);
}

我認為你可以重復使用相同的Map任務。 然后編寫一個commomn Reducer作業，其中Hadoop Framework以密鑰為基礎對數據進行分組。 所以你將能夠獲得ID作為關鍵。 並且您可以緩存其中一個值然后連接。

String merge = "";
public void reduce(Text key, Iterable<Text> values, Context context)
{
    int i =0;
    for(Text value:values)
    {
        if(i == 0){
            merge = value.toString()+",";
        }
        else{
            merge += value.toString();
        }
        i++;
    }
    valEmit.set(merge);
    context.write(key, valEmit);
}

最后，您可以編寫Driver類

public int run(String[] args) throws Exception {
 Configuration c=new Configuration();
 String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
 Path p1=new Path(files[0]);
 Path p2=new Path(files[1]);
 Path p3=new Path(files[2]);
 FileSystem fs = FileSystem.get(c);
 if(fs.exists(p3)){
  fs.delete(p3, true);
  }
 Job job = new Job(c,"Multiple Job");
 job.setJarByClass(MultipleFiles.class);
 MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class);
 MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class);
 job.setReducerClass(MultipleReducer.class);
 .
 .
}

你可以在這里找到這個例子

希望這可以幫助。

UPDATE

輸入1

A 30
D 20

輸入2

A ABC
D EFGH

產量

A ABC 30
D EFGH 20

Mapper.java

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * @author sreeveni
 *
 */
public class Mapper1 extends Mapper<LongWritable, Text, Text, Text> {
    Text keyEmit = new Text();
    Text valEmit = new Text();

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        String parts[] = line.split(" ");
        keyEmit.set(parts[0]);
        valEmit.set(parts[1]);
        context.write(keyEmit, valEmit);
    }
}

Reducer.java

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * @author sreeveni
 *
 */
public class ReducerJoin extends Reducer<Text, Text, Text, Text> {

    Text valEmit = new Text();
    String merge = "";

    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        String character = "";
        String number = "";
        for (Text value : values) {
            // ordering output
            String val = value.toString();
            char myChar = val.charAt(0);

            if (Character.isDigit(myChar)) {
                number = val;
            } else {
                character = val;
            }
        }
        merge = character + " " + number;
        valEmit.set(merge);
        context.write(key, valEmit);
    }

}

司機班

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * @author sreeveni
 *
 */
public class Driver extends Configured implements Tool {
    public static void main(String[] args) throws Exception {
        // TODO Auto-generated method stub
        // checking the arguments count

        if (args.length != 3) {
            System.err
                    .println("Usage : <inputlocation>  <inputlocation>  <outputlocation> ");
            System.exit(0);
        }
        int res = ToolRunner.run(new Configuration(), new Driver(), args);
        System.exit(res);

    }

    @Override
    public int run(String[] args) throws Exception {
        // TODO Auto-generated method stub
        String source1 = args[0];
        String source2 = args[1];
        String dest = args[2];
        Configuration conf = new Configuration();
        conf.set("mapred.textoutputformat.separator", " "); // changing default
                                                            // delimiter to user
                                                            // input delimiter
        FileSystem fs = FileSystem.get(conf);
        Job job = new Job(conf, "Multiple Jobs");

        job.setJarByClass(Driver.class);
        Path p1 = new Path(source1);
        Path p2 = new Path(source2);
        Path out = new Path(dest);
        MultipleInputs.addInputPath(job, p1, TextInputFormat.class,
                Mapper1.class);
        MultipleInputs.addInputPath(job, p2, TextInputFormat.class,
                Mapper1.class);
        job.setReducerClass(ReducerJoin.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setOutputFormatClass(TextOutputFormat.class);

        /*
         * delete if exist
         */
        if (fs.exists(out))
            fs.delete(out, true);

        TextOutputFormat.setOutputPath(job, out);
        boolean success = job.waitForCompletion(true);

        return success ? 0 : 1;
    }

}

Answer 2

你的reducer有一個map方法，但它應該有一個reduce方法，它接受一個Iterable值集合然后合並。 因為您沒有reduce（）方法，所以您將獲得默認行為，即只傳遞所有鍵/值對。

Hadoop多個輸入

問題描述

2 個解決方案

解決方案1
2 已采納 2014-12-08 04:58:08

解決方案2
0 2014-12-11 14:49:22

Hadoop多個輸入

問題描述

2 個解決方案

解決方案1 2 已采納 2014-12-08 04:58:08

解決方案2 0 2014-12-11 14:49:22

解決方案1
2 已采納 2014-12-08 04:58:08

解決方案2
0 2014-12-11 14:49:22