如何在Python中的块中循环遍历Hbase表

Question

I am currently writing a Python script that converts HBase tables into csv using "happybase". 我目前正在编写一个Python脚本，该脚本使用“ happybase”将HBase表转换为csv。 The problem I am having is that if the table is too big, I get the below error after reaching a little over 2 million lines: 我遇到的问题是，如果表太大，则达到200万行后会出现以下错误：

Hbase_thrift.IOError: IOError(message='org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x8dfa2f2 closed\n\tat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1182)\n\tat org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)\n\tat org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)\n\tat org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)\n\tat org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:212)\n\tat org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:314)\n\tat org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:432)\n\tat org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:358)\n\tat org.apache.hadoop.hbase.client.AbstractClientScanner.next(AbstractClientScanner.java:70)\n\tat org.apache.hadoop.hbase.thrift.ThriftServerRunner$HBaseHandler.scannerGetList(ThriftServerRunner.java:1423)\n\tat sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat org.apache.hadoop.hbase.thrift.HbaseHandlerMetricsProxy.invoke(HbaseHandlerMetricsProxy.java:67)\n\tat com.sun.proxy.$Proxy10.scannerGetList(Unknown Source)\n\tat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$scannerGetList.getResult(Hbase.java:4789)\n\tat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$scannerGetList.getResult(Hbase.java:4773)\n\tat org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)\n\tat org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)\n\tat org.apache.hadoop.hbase.thrift.TBoundedThreadPoolServer$ClientConnnection.run(TBoundedThreadPoolServer.java:289)\n\tat org.apache.hadoop.hbase.thrift.CallQueue$Call.run(CallQueue.java:64)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n')

What I though of is chopping the for loop into sub-loops (ie open the Hbase connection -> get the data for the first 100,000 lines -> close the connection -> reopen it again -> get the next 100,000 lines -> close it... and so on), but I can't seem the figure out how to do it. 我的意思是将for循环分成子循环（即打开Hbase连接->获取前100,000行的数据->关闭连接->重新打开它->获取下100,000行->关闭它...等等），但我似乎无法弄清楚该如何做。 Here is a sample of my code that reads all the lines and crashes: 这是读取所有行并崩溃的代码示例：

import happybase
connection = happybase.Connection('localhost')
table = 'some_table'
table_object = connection.table(table)
for row in table_object.scan():
    print row

Any help would be appreciated (even if you suggest another solution :)) 任何帮助将不胜感激（即使您建议其他解决方案：））

Thanks 谢谢

Answer 1

Actually, I figured out the way to do it, and it's as follow: 其实，我想出了办法，方法如下：

import happybase
connection = happybase.Connection('localhost')
table = 'some_table'
table_object = connection.table(table)

while True:
  try:
    for row in table_object.scan():
      print row
    break
  except Exception as e:
    if "org.apache.hadoop.hbase.DoNotRetryIOException" in e.message:
      connection.open()
    else:
      print e
      quit()

如何在Python中的块中循环遍历Hbase表

问题描述

1 个解决方案

解决方案1
0 2017-11-29 21:05:44

如何在Python中的块中循环遍历Hbase表

问题描述

1 个解决方案

解决方案1 0 2017-11-29 21:05:44

解决方案1
0 2017-11-29 21:05:44