简体   繁体   English

Java Hadoop:我如何创建作为输入文件的输出器并给出一个输出,即每个文件中的行数?

[英]Java Hadoop: How can I create mappers that take as input files and give an output which is the number of lines in each file?

I'm new to Hadoop and I've managed just to run the wordCount example: http://hadoop.apache.org/common/docs/r0.18.2/mapred_tutorial.html 我是Hadoop的新手,我只是运行wordCount示例: http ://hadoop.apache.org/common/docs/r0.18.2/mapred_tutorial.html

Suppose we have a folder which has 3 files. 假设我们有一个包含3个文件的文件夹。 I want to have one mapper for each file, and this mapper will just count the number of lines and return it to the reducer. 我希望每个文件都有一个映射器,这个映射器只计算行数并将其返回到reducer。

The reducer will then take as an input the number of lines from each mapper, and give as an output the total number of lines that exist in all 3 files. 然后,reducer将输入每个映射器的行数作为输入,并将所有3个文件中存在的总行数作为输出。

So if we have the following 3 files 所以,如果我们有以下3个文件

input1.txt
input2.txt
input3.txt

and the mappers return: 并且映射器返回:

mapper1 -> [input1.txt, 3]
mapper2 -> [input2.txt, 4]
mapper3 -> [input3.txt, 9]

the reducer will give an output of 减速器将输出

3+4+9 = 16 

I have done this in a simple java application so I would like to do it in Hadoop. 我在一个简单的java应用程序中完成了这个,所以我想在Hadoop中完成它。 I have just 1 computer and would like to try running on a pseudo distributed environment. 我只有一台计算机,并希望尝试在伪分布式环境中运行。

How can I achieve this thing? 我怎样才能实现这个目标? What proper steps should I make? 我应该采取什么适当的措施?

Should my code look like that in the example by apache? 我的代码应该在apache的示例中看起来像那样吗? I will have two static classes, one for mapper one for reducer? 我将有两个静态类,一个用于mapper,一个用于reducer? or should I have 3 classes, one for each mapper? 或者我应该有3个类,每个映射器一个?

if you can please guide me through this, I have no idea how to do this and I believe if I manage to write some code that does this stuff then I will be able to write more complex application in the future. 如果你能指导我完成这个,我不知道如何做到这一点,我相信如果我设法编写一些代码来做这些东西,那么我将来能够编写更复杂的应用程序。

Thanks! 谢谢!

In addition to sa125's answer, you can hugely improve performance by not emitting a record for each input record, but rather just accumulate a counter in the mapper, and then in the mapper clean-up method, emit the filename and count value: 除了sa125的答案之外,你可以通过不为每个输入记录发出记录来大大提高性能,而只是在映射器中累积一个计数器,然后在mapper清理方法中,发出文件名和计数值:

public class LineMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    protected long lines = 0;

    @Override
    protected void cleanup(Context context) throws IOException,
            InterruptedException {
        FileSplit split = (FileSplit) context.getInputSplit();
        String filename = split.getPath().toString();

        context.write(new Text(filename), new LongWritable(lines));
    }

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        lines++;
    }
}

I've noticed you use the docs from 0.18 version. 我注意到你使用的是0.18版本的文档。 Here's a link to 1.0.2 (latest). 这是1.0.2 (最新) 的链接

First advice - use an IDE (eclipse, IDEA, etc.). 第一个建议 - 使用IDE(eclipse,IDEA等)。 It'll really help out with filling the blanks. 填补空白真的很有帮助。

In actual HDFS, you can't know where each piece of a file resides (different machines and clusters). 在实际的HDFS中,您无法知道文件的每个部分所在的位置(不同的计算机和群集)。 There's no guarentee that row X will even reside on the same disk as row Y. There's also no guarantee that row X won't be split across different machines (HDFS distributes data in blocks, typically 64Mb each). 没有保证行X甚至与行Y驻留在同一磁盘上。也不能保证行X不会在不同的机器上分割(HDFS以块的形式分配数据,通常每块64Mb)。 This means that you can't assume the same mapper will handle an entire file. 这意味着您不能假设相同的映射器将处理整个文件。 What you can make sure is that each file is handled by the same reducer . 您可以确保每个文件都由同一个reducer处理

Since a reducer is unique per key sent from the mapper, the way I'd go about doing this is using the filename as my output key in the mapper. 由于reducer对于映射器发送的每个键都是唯一的,所以我这样做的方法是使用文件名作为映射器中的输出键。 In addition, the default input class for a mapper is TextInputFormat , this means each mapper will receive an entire line on its own (terminated by LF or CR). 此外,映射器的默认输入类是TextInputFormat ,这意味着每个映射器将自己接收整行(由LF或CR终止)。 You can then emit the filename and the number 1 (or whatever, it's irrelevant to the calculation) from your mapper. 然后,您可以从映射器中发出文件名和数字1(或者其他与计算无关的内容)。 Then, in the reducer, you simply use a loop to count how many times the filename was received: 然后,在reducer中,您只需使用一个循环来计算接收文件名的次数:

in the mapper's map function 在mapper的map函数中

public static class Map extends Mapper<IntWritable, Text, Text, Text> {

  public void map(IntWritable key, Text value, Context context) {
    // get the filename
    InputSplit split = context.getInputSplit();
    String fileName = split.getPath().getName();

    // send the filename to the reducer, the value
    // has no meaning (I just put "1" to have something)
    context.write( new Text(fileName), new Text("1") );
  }

}

in the reducer's reduce function 在reducer的reduce函数中

public static class Reduce extends Reducer<Text, Text, Text, Text> {

  public void reduce(Text fileName, Iterator<Text> values, Context context) {
    long rowcount = 0;

    // values get one entry for each row, so the actual value doesn't matter
    // (you can also get the size, I'm just lazy here)
    for (Text val : values) {
      rowCount += 1;
    }

    // fileName is the Text key received (no need to create a new object)
    context.write( fileName, new Text( String.valueOf( rowCount ) ) );
  }

}

in the driver/main 在司机/主要

You can pretty much use the same driver as the wordcount example - note that I used the new mapreduce API, so you'll need to adjust some things ( Job instead of JobConf , etc). 您几乎可以使用与wordcount示例相同的驱动程序 - 请注意,我使用了新的mapreduce API,因此您需要调整一些内容( Job而不是JobConf等)。 This was really helpful when I read up on it. 当我读到它时, 这真的很有帮助

Note that your MR output will be just each filename and the rowcount for it: 请注意,您的MR输出将只是每个文件名及其行数:

input1.txt    3
input2.txt    4
input3.txt    9

If you just want to count the TOTAL number of lines in all the files, simply emit the same key in all the mappers (not the filename). 如果您只想计算所有文件中的TOTAL行数,只需在所有映射器中发出相同的键(而不是文件名)。 This way there will be only one reducer to handle all the row counting: 这样,只有一个reducer可以处理所有行计数:

// no need for filename
context.write( new Text("blah"), new Text("1") );

You can also chain a job that'll process the output of the per-file rowcount, or do other fancy stuff - that's up to you. 您还可以链接一个工作,该工作将处理每个文件行数的输出,或者做其他奇特的工作 - 这取决于您。

I left some boilerplate code out, but the basics are there. 我留下了一些样板代码,但基础知识就在那里。 Be sure to check up on me, since I was typing most of this from memory.. :) 一定要检查我,因为我从记忆中输入了大部分内容.. :)

Hope this helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM