简体   繁体   English

Hadoop多个输入

[英]Hadoop multiple inputs

I am using hadoop map reduce and I want to compute two files. 我正在使用hadoop map reduce,我想计算两个文件。 My first Map/Reduce iteration is giving me an a file with a pair ID number like this: 我的第一个Map / Reduce迭代给了我一个带有ID号码的文件,如下所示:

A 30
D 20

My goal is to use that ID from the file to associate with another file and have another output with a trio: ID, Number, Name, like this: 我的目标是使用文件中的ID与另一个文件关联,并使用另一个三重奏输出:ID,Number,Name,如下所示:

A ABC 30
D EFGH 20

But I am not sure whether using Map Reduce is the best way to do this. 但我不确定使用Map Reduce是否是最好的方法。 Would it be better for example to use a File Reader to Read the second input file and get the Name by ID? 例如,使用文件读取器读取第二个输入文件并通过ID获取名称会更好吗? Or can I do it with Map Reduce? 或者我可以使用Map Reduce吗?

If so, I'm trying to find out how. 如果是这样,我正在试图找出方法。 I tried a MultipleInput solution: 我尝试了一个MultipleInput解决方案:

MultipleInputs.addInputPath(job2, new Path(args[1]+"-tmp"),
    TextInputFormat.class, FlightsByCarrierMapper2.class);
MultipleInputs.addInputPath(job2, new Path("inputplanes"),
    TextInputFormat.class, FlightsModeMapper.class); 

But I can't think of any solution to combine the two and get the output I want. 但我想不出任何解决方案将两者结合起来并得到我想要的输出。 The way I have right now is just giving me the list like this example: 我现在的方式是给我这样的列表:

A ABC
A 30
B ABCD
C ABCDEF
D EFGH
D 20

After my Last Reduce I am getting this: 在我最后减少之后我得到了这个:

N125DL  767-332
N125DL  7   , 
N126AT  737-76N
N126AT  19  , 
N126DL  767-332
N126DL  1   , 
N127DL  767-332
N127DL  7   , 
N128DL  767-332
N128DL  3

I want this: N127DL 7 767-332. 我想要这个:N127DL 7 767-332。 And also, I don't want the ones which do not combine. 而且,我不希望那些没有结合的。

And this is my reduce class: 这是我的减少类:

public class FlightsByCarrierReducer2 extends Reducer { 公共类FlightsByCarrierReducer2延伸减速机{

String merge = "";
protected void reduce(Text token, Iterable<Text> values, Context context) 
                            throws IOException, InterruptedException {

    int i = 0;  
    for(Text value:values)
    {
        if(i == 0){
            merge = value.toString()+",";
        }
        else{
            merge += value.toString();
        }
        i++;
    }

        context.write(token, new Text(merge));

}

} }

Update: 更新:

http://stat-computing.org/dataexpo/2009/the-data.html this is the example I'm using. http://stat-computing.org/dataexpo/2009/the-data.html这是我正在使用的例子。

I'm trying with: TailNum and Cancelled which is (1 or 0) get the model name that corresponds to the TailNum. 我正在尝试:TailNum和Canceled,它是(1或0)获取对应于TailNum的模型名称。 My file with model has a TailNumb, Model and other stuff. 我的模型文件有TailNumb,Model和其他东西。 My current output is: 我目前的输出是:

N193JB ERJ 190-100 IGW N193JB ERJ 190-100 IGW

N194DN 767-332 N194DN 767-332

N19503 EMB-135ER N19503 EMB-135ER

N19554 EMB-145LR N19554 EMB-145LR

N195DN 767-332 N195DN 767-332

N195DN 2 N195DN 2

First comes the key, second the model, the keys that has flights cancelled, apperas below the model 首先是钥匙,第二是模型,取消航班的钥匙,在模型下方出现

And I would like a trio Key,Model Number of Cancelled, Because I want number of Cancellations per model 我想要一个三重键,取消的型号,因为我想要每个型号的取消数量

You can join them using ID as key for both mapper. 您可以使用ID作为两个映射器的键来加入它们。 You can write your map task as something like this 您可以像这样编写地图任务

public void map(LongWritable k, Text value, Context context) throws IOException, InterruptedException
{
    //Get the line
    //split the line to get ID seperate
    //word1 = A 
    //word2 = 30
                //Likewise for A ABC
                   //word1 = A 
                  //word2 = ABC
    context.write(word1, word2);
}

I think you can resuse the same Map task. 我认为你可以重复使用相同的Map任务。 And then write a commomn Reducer job where Hadoop Framework groups data on key basis. 然后编写一个commomn Reducer作业,其中Hadoop Framework以密钥为基础对数据进行分组。 So you will be able to get ID as key. 所以你将能够获得ID作为关键。 And You can cache one of the value and then concat. 并且您可以缓存其中一个值然后连接。

String merge = "";
public void reduce(Text key, Iterable<Text> values, Context context)
{
    int i =0;
    for(Text value:values)
    {
        if(i == 0){
            merge = value.toString()+",";
        }
        else{
            merge += value.toString();
        }
        i++;
    }
    valEmit.set(merge);
    context.write(key, valEmit);
}

Finally you can write your Driver class 最后,您可以编写Driver类

public int run(String[] args) throws Exception {
 Configuration c=new Configuration();
 String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
 Path p1=new Path(files[0]);
 Path p2=new Path(files[1]);
 Path p3=new Path(files[2]);
 FileSystem fs = FileSystem.get(c);
 if(fs.exists(p3)){
  fs.delete(p3, true);
  }
 Job job = new Job(c,"Multiple Job");
 job.setJarByClass(MultipleFiles.class);
 MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class);
 MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class);
 job.setReducerClass(MultipleReducer.class);
 .
 .
}

You can find the example HERE 你可以在这里找到这个例子

Hope this helps. 希望这可以帮助。


UPDATE UPDATE

Input1 输入1

A 30
D 20

Input2 输入2

A ABC
D EFGH

Output 产量

A ABC 30
D EFGH 20

Mapper.java Mapper.java

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * @author sreeveni
 *
 */
public class Mapper1 extends Mapper<LongWritable, Text, Text, Text> {
    Text keyEmit = new Text();
    Text valEmit = new Text();

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        String parts[] = line.split(" ");
        keyEmit.set(parts[0]);
        valEmit.set(parts[1]);
        context.write(keyEmit, valEmit);
    }
}

Reducer.java Reducer.java

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * @author sreeveni
 *
 */
public class ReducerJoin extends Reducer<Text, Text, Text, Text> {

    Text valEmit = new Text();
    String merge = "";

    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        String character = "";
        String number = "";
        for (Text value : values) {
            // ordering output
            String val = value.toString();
            char myChar = val.charAt(0);

            if (Character.isDigit(myChar)) {
                number = val;
            } else {
                character = val;
            }
        }
        merge = character + " " + number;
        valEmit.set(merge);
        context.write(key, valEmit);
    }

}

Driver class 司机班

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * @author sreeveni
 *
 */
public class Driver extends Configured implements Tool {
    public static void main(String[] args) throws Exception {
        // TODO Auto-generated method stub
        // checking the arguments count

        if (args.length != 3) {
            System.err
                    .println("Usage : <inputlocation>  <inputlocation>  <outputlocation> ");
            System.exit(0);
        }
        int res = ToolRunner.run(new Configuration(), new Driver(), args);
        System.exit(res);

    }

    @Override
    public int run(String[] args) throws Exception {
        // TODO Auto-generated method stub
        String source1 = args[0];
        String source2 = args[1];
        String dest = args[2];
        Configuration conf = new Configuration();
        conf.set("mapred.textoutputformat.separator", " "); // changing default
                                                            // delimiter to user
                                                            // input delimiter
        FileSystem fs = FileSystem.get(conf);
        Job job = new Job(conf, "Multiple Jobs");

        job.setJarByClass(Driver.class);
        Path p1 = new Path(source1);
        Path p2 = new Path(source2);
        Path out = new Path(dest);
        MultipleInputs.addInputPath(job, p1, TextInputFormat.class,
                Mapper1.class);
        MultipleInputs.addInputPath(job, p2, TextInputFormat.class,
                Mapper1.class);
        job.setReducerClass(ReducerJoin.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setOutputFormatClass(TextOutputFormat.class);

        /*
         * delete if exist
         */
        if (fs.exists(out))
            fs.delete(out, true);

        TextOutputFormat.setOutputPath(job, out);
        boolean success = job.waitForCompletion(true);

        return success ? 0 : 1;
    }

}

Your reducer has a map method, but it should have a reduce method that takes an Iterable collection of values which you then merge. 你的reducer有一个map方法,但它应该有一个reduce方法,它接受一个I​​terable值集合然后合并。 Because you don't have a reduce() method, you get the default behavior which is to just pass through all of the key/value pairs. 因为您没有reduce()方法,所以您将获得默认行为,即只传递所有键/值对。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM