简体   繁体   English

HBase批量加载产生大量的reducer任务-任何解决方法

[英]HBase bulk load spawn high number of reducer tasks - any workaround

HBase bulk load (using configureIncrementalLoad helper method) configures the job to create as many reducer task as the regions in the hbase table. HBase批量加载(使用configureIncrementalLoad helper方法)将作业配置为创建与hbase表中的区域一样多的reducer任务。 So if there are few hundred regions then the job would spawn few hundred reducer tasks. 因此,如果有几百个区域,那么该工作将产生几百个减速器任务。 This could get very slow on a small cluster.. 在小型群集上,这可能会变得非常慢。

Is there any workaround possible by using MultipleOutputFormat or something else? 通过使用MultipleOutputFormat或其他方法,是否可以解决任何问题?

Thanks 谢谢

  1. Sharding the reduce stage by region is giving you a lot of long-term benefit. 按地区分阶段进行还原可为您带来很多长期利益。 You get data locality once the imported data is online. 导入的数据联机后,您将获得数据位置信息。 You also can determine when a region has been load balanced to another server. 您还可以确定区域何时已负载均衡到另一台服务器。 I wouldn't be so quick to go to a coarser granularity. 我不会那么粗粒度。
  2. Since the reduce stage is going a single file write, you should be able to setNumReduceTasks(# of hard drives). 由于reduce阶段只写一个文件,因此您应该能够设置NumReduceTasks(硬盘数量)。 That might speed it up more. 这可能会加快速度。
  3. It's very easy to get network bottlenecked. 使网络成为瓶颈很容易。 Make sure you're compressing your HFile & your intermediate MR data. 确保您正在压缩HFile和中间MR数据。

      job.getConfiguration().setBoolean("mapred.compress.map.output", true); job.getConfiguration().setClass("mapred.map.output.compression.codec", org.apache.hadoop.io.compress.GzipCodec.class, org.apache.hadoop.io.compress.CompressionCodec.class); job.getConfiguration().set("hfile.compression", Compression.Algorithm.LZO.getName()); 
  4. Your data import size might be small enough where you should look at using a Put-based format. 您的数据导入大小可能足够小,您应该在其中使用基于Put的格式。 This will call the normal HTable.Put API and skip the reducer phase. 这将调用普通的HTable.Put API并跳过reducer阶段。 See TableMapReduceUtil.initTableReducerJob(table, null, job). 请参见TableMapReduceUtil.initTableReducerJob(table,null,job)。

When we use HFileOutputFormat, its overrides number of reducers whatever you set. 当我们使用HFileOutputFormat时,无论您设置什么,它都会覆盖化简器的数量。 The number of reducers is equal to number of regions in that HBase table. 减速器的数量等于该HBase表中的区域数量。 So decrease the number of regions if you want to control the number of reducers. 因此,如果要控制减速器的数量,请减少区域的数量。

You will find a sample code here : 您将在此处找到示例代码:

Hope this will be useful :) 希望这会有用:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM