简体繁体 English

用一百万次扫描运行hbase mapreduce作业有意义吗？

[英]Does it make sense to run an hbase mapreduce job with a million Scans?

原文 2014-02-11 21:56:29 4 1 hadoop/ mapreduce/ hbase

I have a dataset in hbase which is large enough that it takes a couple hours to run a mapreduce job on the entire dataset. 我在hbase中有一个数据集，该数据集足够大，要花几个小时才能对整个数据集运行mapreduce作业。 I'd like to be able to break down the data using precomputed indexes: once a day map the entire data set and break it down into multiple indexes: 我希望能够使用预先计算的索引来分解数据：每天一次映射整个数据集并将其分解为多个索引：

1% sample of all users 所有用户样本的1％
All users who are participating in a particular A/B experiment 参与特定A / B实验的所有用户
All users on the nightly prerelease channel. 每晚预发布频道上的所有用户。
All users with a paticular addon (or whatever criterion we're interested in this week) 所有具有特定附加组件的用户（或本周我们感兴趣的任何条件）

My thought was to just store a list of row IDs for the relevant records, and then later people can do little mapreduce jobs on just those rows. 我的想法是只存储相关记录的行ID列表，然后以后人们只能在这些行上执行很少的mapreduce工作。 But a 1% sample is still 1M rows of data, and I'm not sure how to construct a mapreduce job on a list of a million rows. 但是1％的样本仍然是100万行数据，我不确定如何在一百万行的列表上构造mapreduce作业。

Does it make any sense to create a table mapper job using initTableMapperJob(List scans) if there are going to be a million different Scan objects which make up the query? 如果要由一百万个不同的Scan对象组成查询，那么使用initTableMapperJob（List scans）创建表映射器作业是否有意义？ Are there other ways to do this so that I can still farm out the computation and I/O to the hbase cluster efficiently? 还有其他方法可以使我仍然可以将计算和I / O有效地分配给hbase集群吗？

1 个解决方案

Don't do a million scans. 不要进行一百万次扫描。 If you have a million non-contiguous ids, you could run a map/reduce job over the list of ids using a custom input format so that you divide the list up into a reasonable number of partitions (I would guess 4x the number of your m/r slots, but that number is not based on anything). 如果您有一百万个不连续的ID，则可以使用自定义输入格式在ID列表上运行map / reduce作业，以便将列表划分为合理数量的分区（我猜这是您的数量的4倍） m / r插槽，但该数字不基于任何值）。 That would give you a million get operations, which is probably better than a million scans. 这将为您提供一百万次获取操作，这可能比一百万次扫描要好。

If you are lucky enough to have a more reasonable number of contiguous ranges, then scans would be better than straight gets 如果您有幸拥有更合理数量的连续范围，那么扫描将比直接获取更好