简体   繁体   English

使用 Happybase 扫描远程 hbase 表时,发生“Tsocket read 0 bytes Error”

[英]When scan remote hbase table using Happybase, 'Tsocket read 0 bytes Error' happens

I'm trying to scan remote HBASE table which has more than 1,000,000,000 rows.我正在尝试扫描具有超过 1,000,000,000 行的远程 HBASE 表。 After scan, using scanned rows, Try to make csv file using in hdfs.扫描后,使用扫描的行,尝试使用 hdfs 制作 csv 文件。

I tried almost 3 weeks to solve it But i can't.我尝试了近 3 周来解决它,但我不能。

In this way i scan data and make csv file这样我扫描数据并制作csv文件

Error Message错误信息

source of /host/anaconda3/lib/python3.6/site-packages/thriftpy/transport/socket.py /host/anaconda3/lib/python3.6/site-packages/thriftpy/transport/socket.py的来源

source of /host/anaconda3/lib/python3.6/site-packages/thriftpy/transport/socket.py /host/anaconda3/lib/python3.6/site-packages/thriftpy/transport/socket.py的来源

==> I have tried compat protocol , increase network tcp memory buffer, increase time out configuration, setting 1 to 10000 batch size in scan parameter etc.. ==>我尝试过兼容协议,增加网络tcp内存缓冲区,增加超时配置,在扫描参数中设置1到10000批大小等。

But it works well almost for 30 minutes,But suddenly error happens.但它几乎可以正常运行 30 分钟,但突然发生错误。 Almost 1/50 times it finishes well.(works well without any error) Please helps me.几乎 1/50 次它完成得很好。(运行良好,没有任何错误)请帮助我。 I tried to find the cause of Error.我试图找到错误的原因。 But i can't get it.但我无法得到它。

Anybody knows how to solve it?有谁知道如何解决它?

This is my code这是我的代码

import sys
print ("--sys.version--")
print (sys.version)
from pyhive import hive
import csv
import os
import happybase
import time
import subprocess
import datetime
import chardet
import logging
logging.basicConfig(level=logging.DEBUG)


csv_list=[]

col=[]
def conn_base():
    print('conn_base starts')


    #SETTING CONNECTION AND CONFIGURATION
    conn=happybase.Connection('13.xxx.xxx.xxx',port=9090)
    table=conn.table(b'TEMP_TABLE')

    #ITERATE DATA AND MAKE CSV FILE PER 100,000 RECORD. AND TAKE A TIME TO SLEEP PER 500000
    tmp=[]
    print('LET\'S MAKE CSV FILE FROM HBASE')
    index=0
    st=0
    global csv_list
    for row_key, data in table.scan():
        try:
           if (st%1000000==0):
                time.sleep(30)
                print("COUNT: ",st)
            if (st%500000==0):

               print("CHANGE CSV _FILE")
                index+=1
                ta_na='TEMP_TABLE'+str(index)+'_version.csv'
                csv_list.append(ta_na)

            st+=1
            with open('/home/host01/csv_dir/TEMP_TABLE/'+csv_list[index-1] ,'a') as f:
                tmp=[]
                tmp.append(data[b'CF1:XXXXX'].decode())
                tmp.append(data[b'CF1:YYYYY'].decode())
                tmp.append(data[b'CF1:DDDDD'].decode())
                tmp.append(data[b'CF1:SSSSS'].decode())
                tmp.append(data[b'CF1:GGGGG'].decode())
                tmp.append(data[b'CF1:HHHHH'].decode())
                tmp.append(data[b'CF1:QQQQQ'].decode())
                tmp.append(data[b'CF1:WWWWWW'].decode())
                tmp.append(data[b'CF1:EEEEE'].decode())
                tmp.append(data[b'CF1:RRRRR'].decode())


                f.write(",".join(tmp)+'\n')
                tmp=[]

        except:
            pass


        #PUT CSV FILES TO HDFS.
        st=1
        for i in range(len(csv_list)):
            try:
                st+=1
                cmd="hdfs dfs -put /home/host01/csv_dir/TEMP_TABLE"+str(csv_list[i])+" /user/hive/warehouse/TEMP_TABLE/"
                subprocess.call(cmd,shell=True)
                if (st%50==0):
                    time.sleep(5)


            except:
                pass
        cmd="hdfs dfs -put /home/host01/csv_dir/TEMP_TABLE/*.csv  /user/hive/warehouse/TEMP_TABLE/"
        subprocess.call(cmd,shell=True)

        print("PUT ALL CSV FILES TO HDFS")
        conn.close()

First make sure the HBase Thrift server is up and running.首先确保 HBase Thrift 服务器已启动并正在运行。 You can run thrift server with following command:您可以使用以下命令运行 thrift 服务器:

hbase-daemon.sh start thrift [ -p 9090 ]

If you want to specify a port number, use -p.如果要指定端口号,请使用 -p。 Default port is 9090默认端口为9090

you are making it more complicated looking at the code above, it is just few simple steps您正在查看上面的代码使其变得更加复杂,这只是几个简单的步骤

  1. Make sure Hbase Thrift is up and running.(use the command above suggested by)确保 Hbase Thrift 已启动并正在运行。(使用上面建议的命令)
  2. Get webHdfs enables in HDFS setting files.在 HDFS 设置文件中获取 webHdfs 启用。
  3. From hdfs package use insecure client class(if not kerberos authenticated) to directly write files to HDFS(very simple)从 hdfs 包使用不安全的客户端类(如果没有通过 kerberos 身份验证)直接将文件写入 HDFS(非常简单)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM