HBase Scan - RowKey过滤器

Question

So, let's see if I can explain briefly my issue. 那么，让我们看看我是否可以简要解释一下我的问题。

Imagine we got an HBase table that has the information of every visit to a disco: Every disco registers its name , the name of the visitor , and the day he visited it. 试想一下，我们得到了有迪斯科每次访问的信息的HBase的表：每迪斯科注册其名称， 访问者的名字 ，他参观了它的一天。 ( yes it's a dumb example, I know .. ). （ 是的，这是一个愚蠢的例子，我知道 ...... ）。

So, for example, these would be some values of the table: 因此，例如，这些将是表的一些值：

..
ministryOfSoundJamesOliver01022017
ministryOfSoundJamesOliver02022017
ministryOfSoundJamesOliver03022017
ministryOfSoundOliviaNewton04042017
ministryOfSoundOliviaNewton06042017
...
pachaibizaJohnMcKiness06042017
pachaibizaJohnMcKiness04042017
pachaibizaWilliamForrester04042017
..

The RowKey has the following structure: RowKey具有以下结构：

discoName discoName

personName PERSONNAME

dayOfTheYear dayOfTheYear

(the table has some other columns/qualifiers, but I don't mind about them for this issue). （该表有一些其他的列/限定符，但我不介意它们的问题）。

The issue is: imagine a boy that simply loves going to Ministry Of Sound. 问题是：想象一个男孩只是喜欢去声音部。 He just loves it, he spends all his money in disco and drugs ( but that's not the point here ). 他只是喜欢它，他把所有的钱花在迪斯科舞厅和毒品上（ 但这不是重点 ）。

My goal is to output every person who attended Ministry Of Sound . 我的目标是输出所有参加过声音部的人 。 In my scan, this dude keeps appearing in the results, so I must discard a lot of entries in search of the next visitor. 在我的扫描中，这个家伙一直出现在结果中，所以我必须丢弃很多条目来搜索下一个访客。 FE: FE：

..
ministryOfSoundJohnnyYonkie01022017
ministryOfSoundJohnnyYonkie02022017
ministryOfSoundJohnnyYonkie03022017
ministryOfSoundJohnnyYonkie04022017
ministryOfSoundJohnnyYonkie05022017
ministryOfSoundAnotherDude02022017
...

In order to register AnotherDude , I must discard 4 entries from Johnny . 为了注册AnotherDude ，我必须丢弃Johnny的 4个条目。

Finally, the question is: 最后，问题是：

Is there any way to tell HBase that the repetitive entries from byte(x) to byte(x+y) [ x being the number of bytes from discoName and y number of bytes from personName ] must be automatically discarded ? 有没有办法告诉HBase必须自动丢弃从字节（x）到字节（x + y）的重复条目[ x是来自discoName的字节数和来自discoName y字节personName ]？

Thanks a lot in advance!! 非常感谢提前!!

Answer 1

First things first: If you only have client access, I can't help you :( 首先要做的事情：如果你只有客户端访问权限，我无法帮助你:(

If you have additional access, then you could look at the following propositions, but the default reply would be: If this is your access pattern, optimize your schema for it. 如果您有其他访问权限，那么您可以查看以下命题，但默认答案是： 如果这是您的访问模式，请为其优化架构。

If you need to access data in a certain way, make sure you write it in that way, in the first place. 如果您需要以某种方式访问数据，请确保首先以这种方式编写数据。 Use the map-reduce API if you have to perform migrations. 如果必须执行迁移，请使用map-reduce API。

I would probably simply add a table which merely writes a row ministryOfSound and a column per visitor. 我可能只是添加一个表，只写一行ministryOfSound和每个访问者的列。 (In general, the schema you propose doesn't sound very well suited for HBase - since you have a bunch of writes with monotonically increasing rowkeys, if post-processing the duplicate results away is really a performance issue) （一般来说，你提出的模式听起来不太适合HBase - 因为你有一堆单调增加rowkeys的写入，如果对重复结果进行后处理实际上是一个性能问题）

On the other hand, if this is an ad-hoc query, then you probably want to use the mapreduce-API straight away - maybe using the Apache Spark-interconnect and perform a "distinct" call on the data. 另一方面，如果这是一个临时查询，那么您可能希望立即使用mapreduce-API - 可能使用Apache Spark-interconnect并对数据执行“不同”调用。

Using Scans for analytical queries isn't how I would do it. 使用扫描进行分析查询不是我的方法。

If you had to do it using Scans, then I would recommend you implement a CoProcessor. 如果你必须使用扫描，那么我建议你实现一个CoProcessor。 These can augment Filter with state, and you can project the results of a PrefixFilter'd Scan on the Region Server side. 这些可以使用state扩充Filter，并且可以在Region Server端投影PrefixFilter'd Scan的结果。 If you're new to CoProcessors, here's an introduction: HBase: The Definitive Guide . 如果您是CoProcessors的新手，请参阅以下内容： HBase：The Definitive Guide 。 This requires that you can deploy jars into the RegionServer classpath. 这要求您可以将jar部署到RegionServer类路径中。

But again, if you blow up your client by doing a distinct filtering there, you're probably also blowing up your regions due to hotspots on the inserts. 但同样，如果你通过在那里做一个明显的过滤来炸毁你的客户，你可能也会因为插件上的热点而炸毁你的区域。

As a final alternative: You might want to look at Apache Phoenix, and see if you can coerce your rowkey into a schema, from which you can do a select distinct on the first two parts of the rowkey. 作为最后的替代方案：您可能希望查看Apache Phoenix，看看是否可以将您的rowkey强制转换为模式，从中可以对rowkey的前两部分执行select distinct。 This would obviously require that you have delimiter in your rowkey, or at least a fixed length. 这显然要求您在rowkey中具有分隔符，或者至少具有固定长度。

HBase Scan - RowKey过滤器

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-03-16 11:01:38

HBase Scan - RowKey过滤器

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-03-16 11:01:38

解决方案1
2 已采纳 2017-03-16 11:01:38