简体   繁体   English

在mapreduce中处理文件的子集

[英]processing subset of a file in mapreduce

i need to process a huge file using mapreduce and i was required a away to let the end users select how many records they want to process. 我需要使用mapreduce处理一个大文件,并且需要我让最终用户选择他们要处理的记录数量。

The problem is that there isn't any effective way to process only subset of the file without "mapping" the whole file (25tb file) 问题在于,没有任何有效的方法来仅处理文件的一部分而不“映射”整个文件(25TB文件)

is there a way to stop mapping after specific number of record and continue with the reduce part? 有没有一种方法可以在特定数量的记录后停止映射并继续执行约简部分?

There is a very simple and elegant solution to this problem: Override the run() of org.apache.hadoop.mapreduce.Mapper class and only execute map() till you want or only for those records which you need/want. 这个问题有一个非常简单而优雅的解决方案:覆盖org.apache.hadoop.mapreduce.Mapper类的run() ,只执行map()直到您想要或仅对那些您需要/想要的记录。

See the following: 请参阅以下内容:

public static class MapJob extends Mapper<LongWritable, Text, Text, Text> {

    private Text outputKey = new Text();
    private Text outputValue = new Text();
    private int numberOfRecordsToProcess;

    // read numberOfRecordsToProcess in setup method from the configuration values set in the driver class after getting input from user

    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
     // Do your map thing
    }

    @Override
    public void run(Context context) throws IOException, InterruptedException {

        setup(context);
        int count = 0 ;
        while (context.nextKeyValue()) {
            if(count++<numberOfRecordsToProcess){ // check if enough records has been processed already
                map(context.getCurrentKey(), context.getCurrentValue(), context);
            }else{
                break;
            }
        }
    }

    cleanup(context);
}

How to create output files with fixed number of lines in hadoop/map reduce? 如何在hadoop / map reduce中创建具有固定行数的输出文件? , you may use information from this link to run N number of lines as mapper input and runing only one mapper from main class as ,您可以使用此链接中的信息来运行N行作为映射器输入,而只能从主类中运行一个映射器作为

setNumMapTasks(int) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM