需要幫助使用map Hadoop MapReduce實現此算法

Question

我有算法，將通過一個大型數據集讀取一些文本文件，並搜索這些行中的特定術語。 我用Java實現它，但我不想發布代碼，所以它看起來我不是在尋找有人為我實現它，但它確實需要很多幫助！ 這不是我的項目計划，但數據集是巨大的，所以老師告訴我，我必須這樣做。

編輯（我沒有澄清我的previos版本）我的數據集是在Hadoop集群上，我應該使其MapReduce實現

我正在閱讀有關MapReduce的內容，並認為我首先執行標准實現，然后使用mapreduce執行此操作會更容易/更簡單。 但是沒有發生，因為算法非常愚蠢而且沒什么特別的，而且地圖縮小了......我無法將它包裹起來。

所以這里是我算法的偽代碼

LIST termList   (there is method that creates this list from lucene index)
FOLDER topFolder

INPUT topFolder
IF it is folder and not empty
    list files (there are 30 sub folders inside)
    FOR EACH sub folder
        GET file "CheckedFile.txt"
        analyze(CheckedFile)
    ENDFOR
END IF


Method ANALYZE(CheckedFile)

read CheckedFile
WHILE CheckedFile has next line
    GET line
    FOR(loops through termList)
            GET third word from line
          IF third word = term from list
        append whole line to string buffer
    ENDIF
ENDFOR
END WHILE
OUTPUT string buffer to file

另外，正如您所看到的，每次調用“analyze”時，都必須創建新文件，我知道map reduce很難寫入多個輸出???

我理解mapreduce直覺，我的例子似乎非常適合mapreduce，但是當談到這樣做時，顯然我不夠了，我很生氣！

請幫忙。

Answer 1

您可以使用空的reducer，並對作業進行分區，以便為每個文件運行一個映射器。 每個映射器將在輸出文件夾中創建自己的輸出文件。

Answer 2

使用一些不錯的Java 6並發功能，尤其是Future，Callable和ExecutorService，可以輕松實現Map Reduce。

我創建了一個Callable，它將以您指定的方式分析文件

public class FileAnalyser implements Callable<String> {

  private Scanner scanner;
  private List<String> termList;

  public FileAnalyser(String filename, List<String> termList) throws FileNotFoundException {
    this.termList = termList;
    scanner = new Scanner(new File(filename));
  }

  @Override
  public String call() throws Exception {
    StringBuilder buffer = new StringBuilder();
    while (scanner.hasNextLine()) {
      String line = scanner.nextLine();
      String[] tokens = line.split(" ");
      if ((tokens.length >= 3) && (inTermList(tokens[2])))
        buffer.append(line);
    }
    return buffer.toString();
  }

  private boolean inTermList(String term) {
    return termList.contains(term);
  }
}

我們需要為找到的每個文件創建一個新的callable，並將其提交給executor服務。 提交的結果是Future，我們稍后可以使用它來獲取文件解析的結果。

public class Analayser {

  private static final int THREAD_COUNT = 10;

  public static void main(String[] args) {

    //All callables will be submitted to this executor service
    //Play around with THREAD_COUNT for optimum performance
    ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);

    //Store all futures in this list so we can refer to them easily
    List<Future<String>> futureList = new ArrayList<Future<String>>();

    //Some random term list, I don't know what you're using.
    List<String> termList = new ArrayList<String>();
    termList.add("terma");
    termList.add("termb");

    //For each file you find, create a new FileAnalyser callable and submit
    //this to the executor service. Add the future to the list
    //so we can check back on the result later
    for each filename in all files {
      try {
        Callable<String> worker = new FileAnalyser(filename, termList);
        Future<String> future = executor.submit(worker);
        futureList.add(future);
      }
      catch (FileNotFoundException fnfe) {
        //If the file doesn't exist at this point we can probably ignore,
        //but I'll leave that for you to decide.
        System.err.println("Unable to create future for " + filename);
        fnfe.printStackTrace(System.err);
      }
    }

    //You may want to wait at this point, until all threads have finished
    //You could maybe loop through each future until allDone() holds true
    //for each of them.

    //Loop over all finished futures and do something with the result
    //from each
    for (Future<String> current : futureList) {
      String result = current.get();
      //Do something with the result from this future
    }
  }
}

我的例子遠非完整，遠非有效。 我沒有考慮樣本大小，如果它真的很大，你可以繼續循環futureList，刪除已經完成的元素，類似於：

while (futureList.size() > 0) {
      for (Future<String> current : futureList) {
        if (current.isDone()) {
          String result = current.get();
          //Do something with result
          futureList.remove(current);
          break; //We have modified the list during iteration, best break out of for-loop
        }
      }
}

或者，您可以實現生產者 - 消費者類型設置，其中生產者將可調用者提交給執行者服務並生成未來，並且消費者獲取未來的結果並丟棄然后將來。

這可能需要產品和消費者本身就是線程，以及用於添加/刪除期貨的同步列表。

如有任何問題請咨詢。

需要幫助使用map Hadoop MapReduce實現此算法

問題描述

2 個解決方案

解決方案1
3 2010-06-06 23:03:58

解決方案2
2 2010-06-07 09:49:57

需要幫助使用map Hadoop MapReduce實現此算法

問題描述

2 個解決方案

解決方案1 3 2010-06-06 23:03:58

解決方案2 2 2010-06-07 09:49:57

解決方案1
3 2010-06-06 23:03:58

解決方案2
2 2010-06-07 09:49:57