简体   繁体   English

多个扫描对象上的HBase Mapreduce

[英]HBase Mapreduce on multiple scan objects

I am just trying to evaluate HBase for some of data analysis stuff we are doing. 我正在尝试评估HBase我们正在做的一些数据分析。

HBase would contain our event data. HBase将包含我们的事件数据。 Key would be eventId + time. 键将是eventId + time。 We want to run analysis on few events types (4-5) between a date range. 我们希望在日期范围之间对少数事件类型(4-5)运行分析。 Total number of event type is around 1000. 事件类型总数约为1000。

The problem with running mapreduce job on the hbase table is that initTableMapperJob (see below) takes only 1 scan object. 在hbase表上运行mapreduce作业的问题是initTableMapperJob(见下文)只需要1个扫描对象。 For performance reason we want to scan the data for only 4-5 event types in a give date range and not the 1000 event types. 出于性能原因,我们希望在给定日期范围内仅扫描4-5种事件类型的数据,而不是1000种事件类型。 If we use the method below then I guess we don't have that choice because it takes only 1 scan object. 如果我们使用下面的方法,那么我想我们没有那个选择,因为它只需要1个扫描对象。

public static void initTableMapperJob(String table, Scan scan, Class mapper, Class outputKeyClass, Class outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException public static void initTableMapperJob(String table,Scan scan,Class mapper,Class outputKeyClass,Class outputValueClass,org.apache.hadoop.mapreduce.Job job)抛出IOException

Is it possible to run mapreduce on a list of scan objects? 是否可以在扫描对象列表上运行mapreduce? any workaround? 任何解决方法?

Thanks 谢谢

TableMapReduceUtil.initTableMapperJob configures your job to use TableInputFormat which, as you note, takes a single Scan . TableMapReduceUtil.initTableMapperJob将您的作业配置为使用TableInputFormat ,正如您所记录的那样, TableInputFormat需要一次Scan

It sounds like you want to scan multiple segments of a table. 听起来你想扫描一个表的多个部分。 To do so, you'll have to create your own InputFormat , something like MultiSegmentTableInputFormat . 为此,您必须创建自己的InputFormat ,类似于MultiSegmentTableInputFormat Extend TableInputFormatBase and override the getSplits method so that it calls super.getSplits once for each start/stop row segment of the table. 扩展TableInputFormatBase并覆盖getSplits方法,以便为表的每个开始/停止行段调用一次super.getSplits (Easiest way would be to TableInputFormatBase.scan.setStartRow() each time). (每次最简单的方法是使用TableInputFormatBase.scan.setStartRow() )。 Aggregate the InputSplit instances returned to a single list. 将返回的InputSplit实例聚合到单个列表中。

Then configure the job yourself to use your custom MultiSegmentTableInputFormat . 然后自己配置作业以使用自定义MultiSegmentTableInputFormat

I've tried Dave L's approach and it works beautifully. 我尝试过Dave L'的方法而且效果很好。

To configure the map job, you can use the function 要配置地图作业,您可以使用该功能

  TableMapReduceUtil.initTableMapperJob(byte[] table, Scan scan,
  Class<? extends TableMapper> mapper,
  Class<? extends WritableComparable> outputKeyClass,
  Class<? extends Writable> outputValueClass, Job job,
  boolean addDependencyJars, Class<? extends InputFormat> inputFormatClass)

where inputFormatClass refers to the MultiSegmentTableInputFormat mentioned in Dave L's comments. 其中inputFormatClass引用Dave L'评论中提到的MultiSegmentTableInputFormat。

You are looking for the class: 你正在寻找班级:

org/apache/hadoop/hbase/filter/FilterList.java 组织/阿帕奇/的Hadoop / HBase的/过滤/ FilterList.java

Each scan can take a filter. 每次扫描都可以使用过滤器。 A filter can be quite complex. 过滤器可能非常复杂。 The FilterList allows you to specify multiple single filters and then do an AND or an OR between all of the component filters. FilterList允许您指定多个单个过滤器,然后在所有组件过滤器之间执行AND或OR。 You can use this to build up an arbitrary boolean query over the rows. 您可以使用它在行上构建任意布尔查询。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM