简体   繁体   中英

SQl query on Hbase using Spark

I am trying to run SQl like queries on Hbase using Spark I am able to run the queries but on a very small set of data,the moment dataset size is increased Spark is taking long time to complete the job

HBase table Rows-3Million Please find code below-

JavaPairRDD<ImmutableBytesWritable, Result> pairRdd = ctx
            .newAPIHadoopRDD(conf, TableInputFormat.class,
                    ImmutableBytesWritable.class,
                    org.apache.hadoop.hbase.client.Result.class).filter(new Function<Tuple2<ImmutableBytesWritable,Result>, Boolean>() {

                        public Boolean call(
                                Tuple2<ImmutableBytesWritable, Result> v1)
                                throws Exception {
                            long time=Bytes.toLong(v1._2.getValue(
                                    Bytes.toBytes("si"), Bytes.toBytes("at")));
                            if(time>1407314522 && time<1407814522){
                                return true;
                            }
                            return false;
                        }
                    });

JavaRDD people = pairRdd .map(new Function, Person>() {

                public Person call(Tuple2<ImmutableBytesWritable, Result> v1)
                        throws Exception {
                    System.out.println("comming");
                    Person person = new Person();
                    person.setCalling(Bytes.toLong(v1._2.getRow()));
                    person.setCalled(Bytes.toLong(v1._2.getValue(
                            Bytes.toBytes("si"), Bytes.toBytes("called"))));
                    person.setTime(Bytes.toLong(v1._2.getValue(
                            Bytes.toBytes("si"), Bytes.toBytes("at"))));

                    return person;
                }
            });
    JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);
    schemaPeople.registerAsTable("people");

    // SQL can be run over RDDs that have been registered as tables.
    JavaSchemaRDD teenagers = sqlCtx
            .sql("SELECT calling,called FROM people WHERE time >1407314522");
    teenagers.printSchema();
    List<Map<Long, Long>> teenagerNames = teenagers.map(
            new Function<Row, Map<Long, Long>>() {
                public Map<Long, Long> call(Row row) {
                    Map<Long, Long> tmpMap = new HashMap<Long, Long>();
                    tmpMap.put(row.getLong(0), row.getLong(1));
                    return tmpMap;
                }
            }).collect();
    /*
     * for (String name: teenagerNames) { System.out.println(name); }
     */
    for(Map<Long,Long> teenagerNamestmp:teenagerNames){
    for (Map.Entry<Long, Long> entry : teenagerNamestmp.entrySet()) {
        System.out.println(entry.getKey() + "/" + entry.getValue());
    }
}   

I don't know if i am missing setting some configuration Any pointers will be of great help

Thanks,

The TableInputFormat is Map/Reduce-based and thus will be much slower. Look for a native HBase API driven solution in a future release of Spark.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM