简体   繁体   中英

How to pass multiple input format files to map-reduce job?

I am writing map-reduce program to query cassandra column-family. I need to read only subset of rows(using row key) from only one column family. I have the set of row keys of rows what I have to read. How can I pass "row key set" to the map reduce job so that It can output only those subset of rows from cassandra columnfamily?

Abstract:

enter code here

  class GetRows()
  {
   public set<String> getRowKeys()
   {
     logic.....
     return set<string>;
   }
  }


  class MapReduceCassandra()
  {
    inputformat---columnFamilyInputFormat
     .
     ;
    also need input key-set .. How to get it?
  } 

Can any one suggest the best method to call mapreduce from java application and how to pass set of keys to mapreduce?

Calling map reduce from Java

To do this, you can use classes from org.apache.hadoop.mapreduce namespace (you can use older mapred using a very similar approach, just check the API doc) from within your java application:

Job job = Job.getInstance(new Configuration());
// configure job: set input and output types and directories, etc.

job.setJarByClass(MapReduceCassandra.class);
job.submit();

Passing data to the mapreduce job

If your set of row keys is really small, you can serialize it to a string, and pass it as a configuration parameter:

job.getConfiguration().set("CassandraRows", getRowsKeysSerialized()); // TODO: implement serializer

//...

job.submit();

nside the job you'll be able to access the parameter through the context object:

public void map(
    IntWritable key,  // your key type
    Text value,       // your value type
    Context context
)
{
    // ...

    String rowsSerialized = context.getConfiguration().get("CassandraRows");
    String[] rows = deserializeRows(rowsSerialized);  // TODO: implement deserializer

    //...
}

However if your set is potentially unbounded, passing it as a parameter would be a bad idea. Instead you should pass the keys in a file, and take advantage of the distributed cache. Then you can just add this line to the portion above before you submit the job:

job.addCacheFile(new Path(pathToCassandraKeySetFile).toUri());

//...

job.submit();

Inside the job you'll be able to access this file through the context object:

public void map(
    IntWritable key,  // your key type
    Text value,       // your value type
    Context context
)
{
    // ...

    URI[] cacheFiles = context.getCacheFiles();

    // find, open and read your file here

    // ...
}

NOTE : All of this is for the New APIs ( org.apache.hadoop.mapreduce ). If you're using org.apache.hadoop.mapred the approach is very similar, but some relevant methods are invoked on different objects.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM