Hadoop map減少整個文件輸入格式

Question

我正在嘗試使用hadoop map reduce，但是我想在Mapper中一次映射每一行，而是想一次映射整個文件。

所以我找到了這兩個類（ https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/?r=3 ）那個想幫助我做這個。

我收到一個編譯錯誤，說：

JobConf類型中的方法setInputFormat（Class）不適用於參數（Class）Driver.java / ex2 / src第33行Java問題

我改變了我的Driver類

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.InputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

import forma.WholeFileInputFormat;

/*
 * Driver
 * The Driver class is responsible of creating the job and commiting it.
 */
public class Driver {
    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(Driver.class);
        conf.setJobName("Get minimun for each month");

        conf.setOutputKeyClass(IntWritable.class);
        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(Map.class);
        conf.setCombinerClass(Reduce.class);
        conf.setReducerClass(Reduce.class);

        // previous it was 
        // conf.setInputFormat(TextInputFormat.class);
        // And it was changed it to :
        conf.setInputFormat(WholeFileInputFormat.class);

        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf,new Path("input"));
        FileOutputFormat.setOutputPath(conf,new Path("output"));

        System.out.println("Starting Job...");
        JobClient.runJob(conf);
        System.out.println("Job Done!");
    }

}

我究竟做錯了什么？

Answer 1

確保您的wholeFileInputFormat類具有正確的導入。 您在作業驅動程序中使用舊的MapReduce Api。 我認為您在WholeFileInputFormat類中導入了新的API FileInputFormat。 如果我是對的，您應該在wholeFileInputFormat類中導入org.apache.hadoop.mapred.FileInputFormat而不是org.apache.hadoop.mapreduce.lib.input.FileInputFormat 。

希望這可以幫助。

Answer 2

最簡單的方法是gzip你的輸入文件。 這將使FileInputFormat.isSplitable()返回false。

Answer 3

我們也碰到類似的東西，並有一個替代的開箱即用的方法。

假設你需要處理100個大文件（f1，f2，...，f100），這樣你就需要完全在map函數中讀取一個文件。 因此，我們創建了等效的10個文本文件（p1，p2，...，p10），而不是使用“WholeInputFileFormat”讀取器方法，每個文件包含f1-f100文件的HDFS URL或Web URL。

因此p1將包含f1-f10的url，p2將為f11-f20包含url，依此類推。

然后將這些新文件p1到p10用作映射器的輸入。 因此，映射器m1處理文件p1將一次打開文件f1到f10並完全處理它。

這種方法允許我們控制映射器的數量，並在map-reduce應用程序中編寫更詳盡和復雜的應用程序邏輯。 例如，我們可以使用這種方法在PDF文件上運行NLP。

Hadoop map減少整個文件輸入格式

問題描述

3 個解決方案

解決方案1
1 2015-04-17 02:18:19

解決方案2
1 2015-12-14 00:03:01

解決方案3
1 2016-05-24 08:14:06

Hadoop map減少整個文件輸入格式

問題描述

3 個解決方案

解決方案1 1 2015-04-17 02:18:19

解決方案2 1 2015-12-14 00:03:01

解決方案3 1 2016-05-24 08:14:06

解決方案1
1 2015-04-17 02:18:19

解決方案2
1 2015-12-14 00:03:01

解決方案3
1 2016-05-24 08:14:06