I am writing map-reduce program to query cassandra column-family. I need to read only subset of rows(using row key) from only one column family. I have the set of row keys of rows what I have to read. How can I pass "row key set" to the map reduce job so that It can output only those subset of rows from cassandra columnfamily?
Abstract:
enter code here
class GetRows()
{
public set<String> getRowKeys()
{
logic.....
return set<string>;
}
}
class MapReduceCassandra()
{
inputformat---columnFamilyInputFormat
.
;
also need input key-set .. How to get it?
}
Can any one suggest the best method to call mapreduce from java application and how to pass set of keys to mapreduce?
Calling map reduce from Java
To do this, you can use classes from org.apache.hadoop.mapreduce
namespace (you can use older mapred
using a very similar approach, just check the API doc) from within your java application:
Job job = Job.getInstance(new Configuration());
// configure job: set input and output types and directories, etc.
job.setJarByClass(MapReduceCassandra.class);
job.submit();
Passing data to the mapreduce job
If your set of row keys is really small, you can serialize it to a string, and pass it as a configuration parameter:
job.getConfiguration().set("CassandraRows", getRowsKeysSerialized()); // TODO: implement serializer
//...
job.submit();
nside the job you'll be able to access the parameter through the context object:
public void map(
IntWritable key, // your key type
Text value, // your value type
Context context
)
{
// ...
String rowsSerialized = context.getConfiguration().get("CassandraRows");
String[] rows = deserializeRows(rowsSerialized); // TODO: implement deserializer
//...
}
However if your set is potentially unbounded, passing it as a parameter would be a bad idea. Instead you should pass the keys in a file, and take advantage of the distributed cache. Then you can just add this line to the portion above before you submit the job:
job.addCacheFile(new Path(pathToCassandraKeySetFile).toUri());
//...
job.submit();
Inside the job you'll be able to access this file through the context object:
public void map(
IntWritable key, // your key type
Text value, // your value type
Context context
)
{
// ...
URI[] cacheFiles = context.getCacheFiles();
// find, open and read your file here
// ...
}
NOTE : All of this is for the New APIs ( org.apache.hadoop.mapreduce
). If you're using org.apache.hadoop.mapred
the approach is very similar, but some relevant methods are invoked on different objects.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.