[英]When scan remote hbase table using Happybase, 'Tsocket read 0 bytes Error' happens
I'm trying to scan remote HBASE table which has more than 1,000,000,000 rows.我正在尝试扫描具有超过 1,000,000,000 行的远程 HBASE 表。 After scan, using scanned rows, Try to make csv file using in hdfs.
扫描后,使用扫描的行,尝试使用 hdfs 制作 csv 文件。
I tried almost 3 weeks to solve it But i can't.我尝试了近 3 周来解决它,但我不能。
In this way i scan data and make csv file这样我扫描数据并制作csv文件
source of /host/anaconda3/lib/python3.6/site-packages/thriftpy/transport/socket.py /host/anaconda3/lib/python3.6/site-packages/thriftpy/transport/socket.py的来源
source of /host/anaconda3/lib/python3.6/site-packages/thriftpy/transport/socket.py /host/anaconda3/lib/python3.6/site-packages/thriftpy/transport/socket.py的来源
==> I have tried compat protocol , increase network tcp memory buffer, increase time out configuration, setting 1 to 10000 batch size in scan parameter etc.. ==>我尝试过兼容协议,增加网络tcp内存缓冲区,增加超时配置,在扫描参数中设置1到10000批大小等。
But it works well almost for 30 minutes,But suddenly error happens.但它几乎可以正常运行 30 分钟,但突然发生错误。 Almost 1/50 times it finishes well.(works well without any error) Please helps me.
几乎 1/50 次它完成得很好。(运行良好,没有任何错误)请帮助我。 I tried to find the cause of Error.
我试图找到错误的原因。 But i can't get it.
但我无法得到它。
Anybody knows how to solve it?有谁知道如何解决它?
This is my code这是我的代码
import sys
print ("--sys.version--")
print (sys.version)
from pyhive import hive
import csv
import os
import happybase
import time
import subprocess
import datetime
import chardet
import logging
logging.basicConfig(level=logging.DEBUG)
csv_list=[]
col=[]
def conn_base():
print('conn_base starts')
#SETTING CONNECTION AND CONFIGURATION
conn=happybase.Connection('13.xxx.xxx.xxx',port=9090)
table=conn.table(b'TEMP_TABLE')
#ITERATE DATA AND MAKE CSV FILE PER 100,000 RECORD. AND TAKE A TIME TO SLEEP PER 500000
tmp=[]
print('LET\'S MAKE CSV FILE FROM HBASE')
index=0
st=0
global csv_list
for row_key, data in table.scan():
try:
if (st%1000000==0):
time.sleep(30)
print("COUNT: ",st)
if (st%500000==0):
print("CHANGE CSV _FILE")
index+=1
ta_na='TEMP_TABLE'+str(index)+'_version.csv'
csv_list.append(ta_na)
st+=1
with open('/home/host01/csv_dir/TEMP_TABLE/'+csv_list[index-1] ,'a') as f:
tmp=[]
tmp.append(data[b'CF1:XXXXX'].decode())
tmp.append(data[b'CF1:YYYYY'].decode())
tmp.append(data[b'CF1:DDDDD'].decode())
tmp.append(data[b'CF1:SSSSS'].decode())
tmp.append(data[b'CF1:GGGGG'].decode())
tmp.append(data[b'CF1:HHHHH'].decode())
tmp.append(data[b'CF1:QQQQQ'].decode())
tmp.append(data[b'CF1:WWWWWW'].decode())
tmp.append(data[b'CF1:EEEEE'].decode())
tmp.append(data[b'CF1:RRRRR'].decode())
f.write(",".join(tmp)+'\n')
tmp=[]
except:
pass
#PUT CSV FILES TO HDFS.
st=1
for i in range(len(csv_list)):
try:
st+=1
cmd="hdfs dfs -put /home/host01/csv_dir/TEMP_TABLE"+str(csv_list[i])+" /user/hive/warehouse/TEMP_TABLE/"
subprocess.call(cmd,shell=True)
if (st%50==0):
time.sleep(5)
except:
pass
cmd="hdfs dfs -put /home/host01/csv_dir/TEMP_TABLE/*.csv /user/hive/warehouse/TEMP_TABLE/"
subprocess.call(cmd,shell=True)
print("PUT ALL CSV FILES TO HDFS")
conn.close()
First make sure the HBase Thrift server is up and running.首先确保 HBase Thrift 服务器已启动并正在运行。 You can run thrift server with following command:
您可以使用以下命令运行 thrift 服务器:
hbase-daemon.sh start thrift [ -p 9090 ]
If you want to specify a port number, use -p.如果要指定端口号,请使用 -p。 Default port is
9090
默认端口为
9090
you are making it more complicated looking at the code above, it is just few simple steps您正在查看上面的代码使其变得更加复杂,这只是几个简单的步骤
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.