简体   繁体   中英

distributed analysis of hbase data

I'm a bit new to hbase and have been able to setup hbase and query the data thats being stored on multiple hadoop machines but I'm wondering if its possible to distribute the analysis of data in hbase as well.

Here's my situation, I have a few billion records that I need to analyse quickly and I would like to have X servers query the database and get unique parts of the query so they can work on it instead of having a single server that goes through the entire dataset. Is this possible and how can I do it?

I'm very unsure how to approach this because I realize all the queries will need to be coordinated(each server cannot query hbase individually otherwise hbase will not know how to split the request among the servers). I'm confused but thought maybe there's either a native way to do this in hadoop?

If it helps, my application is running java and I'm running the cluster on EC2 using the cloudera distribution.

HBase builds on Hadoop for a reason :) you can use Hadoop's map-reduce framework to distribute analytics and let hadoop/hbase take care of distributing the load. You can start with the docs to see what can be done.

Another option you have is to write co-processors. Coprocessors run on the region servers so they work close to the data. You can find a nice intro here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM