简体   繁体   English

Hbase扫描与Mapreduce进行即时计算

[英]Hbase scan vs Mapreduce for on the fly computation

I need to compute aggregate over HBase table. 我需要计算HBase表上的聚合。

Say I have this hbase table: 'metadata' Column family:M column:n 说我有这个hbase表:'元数据'列族:M列:n

Here metadata object has a list of strings 这里元数据对象有一个字符串列表

class metadata { List tags; 类元数据{列表标签;

} }

I need to compute the count of tags for which I was thinking of using either using mapreduce or scan over hbase directly. 我需要计算我正在考虑使用mapreduce或直接扫描hbase的标签的数量。

The result has to be returned on the fly . 结果必须即时返回。 So which one can I use in this scenario? 那么在这种情况下我可以使用哪一个呢? Scan over hbase and compute the aggregate or mapreduce? 扫描hbase并计算聚合或mapreduce?

Mapreduce ultimately is going to scan hbase and compute the count. Mapreduce最终将扫描hbase并计算计数。

What are the pros and cons of using either of these? 使用这两种方法的利弊是什么?

I suspect you're not aware about what are the pros and cons of HBase, it's not suited for computing realtime aggregations of large datasets. 我怀疑您不了解HBase的优缺点,它不适用于计算大型数据集的实时聚合。

Let's start by saying that MapReduce is a scheduled job by itself, you won't be able to return the response on the fly, expect no less than 15 seconds for the Task Tracker to initialize the job. 首先让我们说MapReduce本身就是一个计划的作业,您将无法即时返回响应,任务跟踪器至少需要15秒来初始化作业。

In the end, the MapReduce Job will do exactly the same thing: a HBase scan, the difference between performing the scan right-away and the MapReduce it's just the paralellization and data locality, which excels when you have millions/billions of rows. 最后,MapReduce作业将做完全相同的事情:HBase扫描,立即执行扫描与MapReduce之间的区别只是并行处理和数据局部性,当您拥有数百万/数十亿行时,这是出色的。 If your queries only needs to read a few thousand consecutive rows to aggregate them, sure, you could just do a scan and it will probably have an acceptable response time, but for larger datasets it's just going to be impossible to do that at query time. 如果您的查询只需要读取几千个连续的行来汇总它们,那么可以进行扫描,并且它可能具有可接受的响应时间,但是对于较大的数据集,将不可能在查询时进行扫描。

HBase is best suited for handling tons of atomic reads and writes, that way, you can maintain those aggregations in real time, no matter how many pre-aggregated counters you'll need or how many requests you're going to receive: with a proper row key design and split policy you can scale to satisfy the demand. HBase最适合处理大量的原子读取和写入,这样,无论您需要多少预聚集计数器或将要接收多少请求,您都可以实时维护这些聚集。适当的行键设计和拆分策略,您可以扩展以满足需求。

Think of it as a word count, you could store all the words in a list and count them at query-time when requested or you can process that list at insert-time and store the number of times each word is used in the document, as global counter, and in a daily, monthly, yearly, per-country, per-author tables (or even families). 可以将其视为一个单词计数,您可以将所有单词存储在列表中,并在查询时对它们进行计数,也可以在插入时处理该列表,并存储每个单词在文档中的使用次数,作为全球计数器,并在每天,每月,每年,每个国家/地区,每个作者的表(甚至家庭)中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM