简体繁体 English

如何将 HBase 的扫描限制为 MapReduce 作业的仅相关（未过滤）区域

[英]How can I limit the scan of HBase to only relevant (Unfiltered) regions for the MapReduce job

原文 2019-08-08 16:37:43 6 1 java/ mapreduce/ hbase

I am running a mapreduce job to export data from HBase to HDFS.我正在运行 mapreduce 作业以将数据从 HBase 导出到 HDFS。 There are multiple filters being applied to the scan.有多个过滤器应用于扫描。

It is not possible to limit the scan by the row key as it does not contain required information.不可能通过行键限制扫描，因为它不包含所需的信息。

When it comes to running MR job, YARN creates a mapper for each region in HBase.在运行 MR 作业时，YARN 会为 HBase 中的每个区域创建一个映射器。 Some of those regions contain only filtered data and hence mappers don't receive anything to read and get terminated after a period of time.其中一些区域仅包含过滤的数据，因此映射器不会收到任何要读取的内容并在一段时间后终止。 The volume of data to be extracted is significantly less then the total amount of data, so the job eventually fails because of the large number of mappers being terminated.要提取的数据量明显少于数据总量，因此作业最终会因为大量映射器被终止而失败。

The answer I am not looking for:我不是在寻找答案：

Implementing "manual" filtering within the mapper.在映射器中实现“手动”过滤。
Increasing timeout interval.增加超时间隔。

What I am looking for is one of these:我正在寻找的是其中之一：

A link to an article about how this problem is solved.有关如何解决此问题的文章的链接。
An efficient solution or an idea (Not necessarily with code) for this, which does not involve running a full HBase table through mappers.一个有效的解决方案或想法（不一定带有代码），不涉及通过映射器运行完整的 HBase 表。 Or at least (Let's be real) reduces the compute load within the mappers.或者至少（让我们成为现实）减少映射器内的计算负载。
A confirmation that there is no efficient way of doing this, as I've spent a fair amount of time looking for this.确认没有有效的方法可以做到这一点，因为我花了很多时间来寻找这一点。

I believe that the code sample is not necessary as the person who understands HBase will know what I am asking for.我相信代码示例不是必需的，因为了解 HBase 的人会知道我要什么。

Thanks in advice.谢谢指教。

1 个解决方案

In order to solve this problem I've created a MR job.为了解决这个问题，我创建了一个 MR 工作。

Mapper classified each row key in to one of the categories and picked the first and last element for each type (Because everything is sorted within a region). Mapper 将每个行键归入其中一个类别，并为每种类型选择第一个和最后一个元素（因为所有内容都在一个区域内排序）。 In order to pick the last element, I've been updating a single object and assigning each value, which was landing in to a mapper.为了选择最后一个元素，我一直在更新单个对象并分配每个值，这些值将登陆映射器。 Then I wrote both values in to context in the cleanup phase (classifier_name as a key and row_key as a value).然后我在清理阶段将这两个值写入上下文（classifier_name 作为键，row_key 作为值）。

Mappers outputs were light (number of categories * 2), so I've set the number of reducers to 1, and wrote some basic logic to create object with low_row/high_row, which was updated on the flight and I did not have to sort anything at the end.映射器的输出很轻（类别数 * 2），所以我将减速器的数量设置为 1，并编写了一些基本逻辑来创建具有 low_row/high_row 的对象，这些逻辑在飞行中更新，我不必排序最后什么都可以。 So the final output was of the form:所以最终输出的形式是：
classifier_name, start_rowKey, end_rowKey

I was then able to use these values to limit my scan.然后我就可以使用这些值来限制我的扫描。

Hope that will help someone :)希望这会帮助某人:)