How to pass multiple input format files to map-reduce job?

Question

I am writing map-reduce program to query cassandra column-family. I need to read only subset of rows(using row key) from only one column family. I have the set of row keys of rows what I have to read. How can I pass "row key set" to the map reduce job so that It can output only those subset of rows from cassandra columnfamily?

Abstract:

enter code here

  class GetRows()
  {
   public set<String> getRowKeys()
   {
     logic.....
     return set<string>;
   }
  }


  class MapReduceCassandra()
  {
    inputformat---columnFamilyInputFormat
     .
     ;
    also need input key-set .. How to get it?
  }

Can any one suggest the best method to call mapreduce from java application and how to pass set of keys to mapreduce?

Answer 1

Calling map reduce from Java

To do this, you can use classes from org.apache.hadoop.mapreduce namespace (you can use older mapred using a very similar approach, just check the API doc) from within your java application:

Job job = Job.getInstance(new Configuration());
// configure job: set input and output types and directories, etc.

job.setJarByClass(MapReduceCassandra.class);
job.submit();

Passing data to the mapreduce job

If your set of row keys is really small, you can serialize it to a string, and pass it as a configuration parameter:

job.getConfiguration().set("CassandraRows", getRowsKeysSerialized()); // TODO: implement serializer

//...

job.submit();

nside the job you'll be able to access the parameter through the context object:

public void map(
    IntWritable key,  // your key type
    Text value,       // your value type
    Context context
)
{
    // ...

    String rowsSerialized = context.getConfiguration().get("CassandraRows");
    String[] rows = deserializeRows(rowsSerialized);  // TODO: implement deserializer

    //...
}

However if your set is potentially unbounded, passing it as a parameter would be a bad idea. Instead you should pass the keys in a file, and take advantage of the distributed cache. Then you can just add this line to the portion above before you submit the job:

job.addCacheFile(new Path(pathToCassandraKeySetFile).toUri());

//...

job.submit();

Inside the job you'll be able to access this file through the context object:

public void map(
    IntWritable key,  // your key type
    Text value,       // your value type
    Context context
)
{
    // ...

    URI[] cacheFiles = context.getCacheFiles();

    // find, open and read your file here

    // ...
}

NOTE : All of this is for the New APIs ( org.apache.hadoop.mapreduce ). If you're using org.apache.hadoop.mapred the approach is very similar, but some relevant methods are invoked on different objects.

How to pass multiple input format files to map-reduce job?

Question

1 answers

solution1
0 ACCPTED 2014-02-20 21:21:07

How to pass multiple input format files to map-reduce job?

Question

1 answers

solution1 0 ACCPTED 2014-02-20 21:21:07

solution1
0 ACCPTED 2014-02-20 21:21:07