如何將多個輸入格式文件傳遞給map-reduce作業？

Question

我在寫map-reduce程序來查詢cassandra column-family。 我只需要從一個列族中讀取行的子集（使用行鍵）。 我有要閱讀的行的行鍵集。 如何將“行鍵集”傳遞給地圖歸約作業，以便它只能輸出cassandra columnfamily中的那些行子集？

抽象：

enter code here

  class GetRows()
  {
   public set<String> getRowKeys()
   {
     logic.....
     return set<string>;
   }
  }


  class MapReduceCassandra()
  {
    inputformat---columnFamilyInputFormat
     .
     ;
    also need input key-set .. How to get it?
  }

誰能建議從Java應用程序調用mapreduce的最佳方法，以及如何將一組鍵傳遞給mapreduce？

Answer 1

從Java調用map reduce

為此，可以從Java應用程序中使用org.apache.hadoop.mapreduce命名空間中的類（可以使用非常相似的方法使用較舊的mapred ，只需檢查API文檔）：

Job job = Job.getInstance(new Configuration());
// configure job: set input and output types and directories, etc.

job.setJarByClass(MapReduceCassandra.class);
job.submit();

將數據傳遞給mapreduce作業

如果您的行鍵集非常小，則可以將其序列化為字符串，並將其作為配置參數傳遞：

job.getConfiguration().set("CassandraRows", getRowsKeysSerialized()); // TODO: implement serializer

//...

job.submit();

在作業旁邊，您將可以通過上下文對象訪問參數：

public void map(
    IntWritable key,  // your key type
    Text value,       // your value type
    Context context
)
{
    // ...

    String rowsSerialized = context.getConfiguration().get("CassandraRows");
    String[] rows = deserializeRows(rowsSerialized);  // TODO: implement deserializer

    //...
}

但是，如果您的集合可能不受限制，那么將其作為參數傳遞將是一個壞主意。 相反，您應該在文件中傳遞密鑰，並利用分布式緩存。 然后，您可以在提交作業之前將此行添加到上面的部分：

job.addCacheFile(new Path(pathToCassandraKeySetFile).toUri());

//...

job.submit();

在作業內部，您將可以通過上下文對象訪問此文件：

public void map(
    IntWritable key,  // your key type
    Text value,       // your value type
    Context context
)
{
    // ...

    URI[] cacheFiles = context.getCacheFiles();

    // find, open and read your file here

    // ...
}

注意：所有這些都是針對新API（ org.apache.hadoop.mapreduce ）的。 如果您使用的是org.apache.hadoop.mapred該方法非常相似，但是在不同的對象上會調用一些相關的方法。

如何將多個輸入格式文件傳遞給map-reduce作業？

問題描述

1 個解決方案

解決方案1
0 已采納 2014-02-20 21:21:07

如何將多個輸入格式文件傳遞給map-reduce作業？

問題描述

1 個解決方案

解決方案1 0 已采納 2014-02-20 21:21:07

解決方案1
0 已采納 2014-02-20 21:21:07