I am trying to run SQl like queries on Hbase using Spark I am able to run the queries but on a very small set of data,the moment dataset size is increased Spark is taking long time to complete the job
HBase table Rows-3Million Please find code below-
JavaPairRDD<ImmutableBytesWritable, Result> pairRdd = ctx
.newAPIHadoopRDD(conf, TableInputFormat.class,
ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class).filter(new Function<Tuple2<ImmutableBytesWritable,Result>, Boolean>() {
public Boolean call(
Tuple2<ImmutableBytesWritable, Result> v1)
throws Exception {
long time=Bytes.toLong(v1._2.getValue(
Bytes.toBytes("si"), Bytes.toBytes("at")));
if(time>1407314522 && time<1407814522){
return true;
}
return false;
}
});
JavaRDD people = pairRdd .map(new Function, Person>() {
public Person call(Tuple2<ImmutableBytesWritable, Result> v1)
throws Exception {
System.out.println("comming");
Person person = new Person();
person.setCalling(Bytes.toLong(v1._2.getRow()));
person.setCalled(Bytes.toLong(v1._2.getValue(
Bytes.toBytes("si"), Bytes.toBytes("called"))));
person.setTime(Bytes.toLong(v1._2.getValue(
Bytes.toBytes("si"), Bytes.toBytes("at"))));
return person;
}
});
JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);
schemaPeople.registerAsTable("people");
// SQL can be run over RDDs that have been registered as tables.
JavaSchemaRDD teenagers = sqlCtx
.sql("SELECT calling,called FROM people WHERE time >1407314522");
teenagers.printSchema();
List<Map<Long, Long>> teenagerNames = teenagers.map(
new Function<Row, Map<Long, Long>>() {
public Map<Long, Long> call(Row row) {
Map<Long, Long> tmpMap = new HashMap<Long, Long>();
tmpMap.put(row.getLong(0), row.getLong(1));
return tmpMap;
}
}).collect();
/*
* for (String name: teenagerNames) { System.out.println(name); }
*/
for(Map<Long,Long> teenagerNamestmp:teenagerNames){
for (Map.Entry<Long, Long> entry : teenagerNamestmp.entrySet()) {
System.out.println(entry.getKey() + "/" + entry.getValue());
}
}
I don't know if i am missing setting some configuration Any pointers will be of great help
Thanks,
The TableInputFormat is Map/Reduce-based and thus will be much slower. Look for a native HBase API driven solution in a future release of Spark.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.