简体   繁体   English

PIG中是否有HBaseStorage的替代方案

[英]Is there an Alternative for HBaseStorage in PIG

I am using HBaseStorage with -caching option in pig script as follows 我在猪脚本中将HBaseStorage与-caching选项一起使用,如下所示

HBaseStorage('countDetails:ansCount countDetails:divCount countDetails:unansCount countDetails:engCount countDetails:ineffCount countDetails:totalCount', '-caching 1000');

I can see this was reflecting in my job.xml but I can see there is no time difference in it I am processing 10 million records and store data around 160mb in to HBase. 我可以看到这反映在我的job.xml中,但是我没有看到时差,我正在处理1000万条记录并将大约160mb的数据存储到HBase中。 When I store the result in hdfs its taking 3 mins to process the same job takes 30mins to store into HBase. 当我将结果存储在hdfs中时,需要3分钟来处理同一作业,需要30分钟才能存储到HBase中。

I even tried by setting 我什至尝试设置

SET hbase.client.scanner.caching 1000;

Please let me know how can I reduce the time. 请让我知道如何减少时间。 Is there any alternative for HBaseStorage? HBaseStorage是否有其他选择? http://apmblog.compuware.com/2013/02/19/speeding-up-a-pighbase-mapreduce-job-by-a-factor-of-15/ http://apmblog.compuware.com/2013/02/19/speeding-up-a-pighbase-mapreduce-job-by-a-factor-of-15/

the above blog says that I have to set hbase.client.scanner.caching in bootstrap scrip I don't know how to do that will it be enough If I set it in Hbase-conf. 上面的博客说我必须在引导脚本中设置hbase.client.scanner.caching我不知道该怎么做,如果我在Hbase-conf中设置它。 Please help me out of this 请帮助我

hbase.client.scanner.caching points to number of rows that will be fetched when calling next on a scanner if it is not served from (local, client) memory. hbase.client.scanner.caching指向如果未从(本地,客户端)内存提供服务,则在扫描仪上调用next时将获取的行数。

Higher caching values will enable faster scanners but will eat up more memory and some calls of next may take longer and longer time when the cache is empty. 较高的缓存值将启用更快的扫描程序,但会消耗更多的内存,并且当缓存为空时,对next的某些调用可能会花费越来越长的时间。 Do not set this value such that the time between invocations is greater than the scanner timeout; 请勿将此值设置为使得两次调用之间的时间大于扫描程序超时; ie hbase.regionserver.lease.period This property is 1 min by default. hbase.regionserver.lease.period此属性默认为1分钟。 Clients must report in within this period else they are considered dead. 客户必须在此期间内报告,否则将被视为死亡。

In my experience HBase doesn't perform very well with Pig. 以我的经验,HBase在Pig方面的表现不佳。 It you don't have requirement for random look-up then use only HDFS otherwie HBase MR job would be better option. 如果您不需要随机查找,则仅使用HDFS,否则HBase MR工作将是更好的选择。 Also, In Hadoop MR job, you can connect to Hbase(This option gave me the best performance). 另外,在Hadoop MR作业中,您可以连接到Hbase(此选项为我提供了最佳性能)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM