简体   繁体   English

Pig HbaseStorage定制

[英]Pig HbaseStorage customization

How can I customize HbaseStorage for pig script? 如何为猪脚本自定义HbaseStorage? Actually I want to perform some business logic on the data before loading it to the pig script. 实际上,我想在将数据加载到Pig脚本之前对数据执行一些业务逻辑。 It would be something like custom storage on top of HbaseStorage. 就像在HbaseStorage之上的自定义存储一样。

eg I've my row key has structure like this A_B_C. 例如,我的行键具有类似A_B_C的结构。 Currently, I'm passing A_B_C key in HbaseStorage in my pig script but I want to perform some logic like filtering etc against key like A_B_C_D before serving input data to actual pig script. 目前,我正在我的Pig脚本中的HbaseStorage中传递A_B_C键,但是我想在将输入数据提供给实际的Pig脚本之前对诸如A_B_C_D的键执行一些过滤等逻辑。 How is it possible 这怎么可能

You may have to end up looking at the HBaseStorage java class and implementing your own classes based on that. 您可能最终不得不查看HBaseStorage java类并基于该类实现自己的类。 Depending on how the HBaseStorage and associated classes have been written, this could vary from being easy (just extend HBaseStorage itself and overwrite where necessary) to a real headache. 取决于编写HBaseStorage和相关类的方式,这可能从简单(只需扩展HBaseStorage本身并在必要时覆盖)到真正令人头疼。

You then have to ensure that the .jar containing your code is on the pig classpath. 然后,您必须确保包含代码的.jar位于pig类路径上。

I find HbaseStorage to be a real pain, so I write regular Java MR jobs to query HBase and create custom sequence files, which I then use from Pig with a simple custom loader. 我发现HbaseStorage确实很痛苦,因此我编写常规的Java MR作业来查询HBase并创建自定义序列文件,然后使用简单的自定义加载器从Pig中使用它。 I find this saves a ton of time since the sequence file can be re-used many times throughout the day to get quick results, rather than scanning everything in Hbase for every Pig script. 我发现这可以节省大量时间,因为序列文件可以在一天内多次重复使用以获得快速结果,而不是为每个Pig脚本扫描Hbase中的所有内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM