[英]HBase bulk load spawn high number of reducer tasks - any workaround
HBase bulk load (using configureIncrementalLoad helper method) configures the job to create as many reducer task as the regions in the hbase table. HBase批量加载(使用configureIncrementalLoad helper方法)将作业配置为创建与hbase表中的区域一样多的reducer任务。 So if there are few hundred regions then the job would spawn few hundred reducer tasks.
因此,如果有几百个区域,那么该工作将产生几百个减速器任务。 This could get very slow on a small cluster..
在小型群集上,这可能会变得非常慢。
Is there any workaround possible by using MultipleOutputFormat or something else? 通过使用MultipleOutputFormat或其他方法,是否可以解决任何问题?
Thanks 谢谢
It's very easy to get network bottlenecked. 使网络成为瓶颈很容易。 Make sure you're compressing your HFile & your intermediate MR data.
确保您正在压缩HFile和中间MR数据。
job.getConfiguration().setBoolean("mapred.compress.map.output", true); job.getConfiguration().setClass("mapred.map.output.compression.codec", org.apache.hadoop.io.compress.GzipCodec.class, org.apache.hadoop.io.compress.CompressionCodec.class); job.getConfiguration().set("hfile.compression", Compression.Algorithm.LZO.getName());
Your data import size might be small enough where you should look at using a Put-based format. 您的数据导入大小可能足够小,您应该在其中使用基于Put的格式。 This will call the normal HTable.Put API and skip the reducer phase.
这将调用普通的HTable.Put API并跳过reducer阶段。 See TableMapReduceUtil.initTableReducerJob(table, null, job).
请参见TableMapReduceUtil.initTableReducerJob(table,null,job)。
When we use HFileOutputFormat, its overrides number of reducers whatever you set. 当我们使用HFileOutputFormat时,无论您设置什么,它都会覆盖化简器的数量。 The number of reducers is equal to number of regions in that HBase table.
减速器的数量等于该HBase表中的区域数量。 So decrease the number of regions if you want to control the number of reducers.
因此,如果要控制减速器的数量,请减少区域的数量。
You will find a sample code here : 您将在此处找到示例代码:
Hope this will be useful :) 希望这会有用:)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.