简体   繁体   中英

Hbase scan vs Mapreduce for on the fly computation

I need to compute aggregate over HBase table.

Say I have this hbase table: 'metadata' Column family:M column:n

Here metadata object has a list of strings

class metadata { List tags;

}

I need to compute the count of tags for which I was thinking of using either using mapreduce or scan over hbase directly.

The result has to be returned on the fly . So which one can I use in this scenario? Scan over hbase and compute the aggregate or mapreduce?

Mapreduce ultimately is going to scan hbase and compute the count.

What are the pros and cons of using either of these?

I suspect you're not aware about what are the pros and cons of HBase, it's not suited for computing realtime aggregations of large datasets.

Let's start by saying that MapReduce is a scheduled job by itself, you won't be able to return the response on the fly, expect no less than 15 seconds for the Task Tracker to initialize the job.

In the end, the MapReduce Job will do exactly the same thing: a HBase scan, the difference between performing the scan right-away and the MapReduce it's just the paralellization and data locality, which excels when you have millions/billions of rows. If your queries only needs to read a few thousand consecutive rows to aggregate them, sure, you could just do a scan and it will probably have an acceptable response time, but for larger datasets it's just going to be impossible to do that at query time.

HBase is best suited for handling tons of atomic reads and writes, that way, you can maintain those aggregations in real time, no matter how many pre-aggregated counters you'll need or how many requests you're going to receive: with a proper row key design and split policy you can scale to satisfy the demand.

Think of it as a word count, you could store all the words in a list and count them at query-time when requested or you can process that list at insert-time and store the number of times each word is used in the document, as global counter, and in a daily, monthly, yearly, per-country, per-author tables (or even families).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM