简体   繁体   English

将数据上传到Apache Hbase时出现管道破裂错误

[英]Broken Pipe error when uploading data to Apache Hbase

I'm currently trying to load a large CSV into Apache hbase. 我目前正在尝试将大型CSV加载到Apache hbase中。 The CSV is 50,000 columns wide and 15,000 rows. CSV为50,000列宽和15,000行。 The values for the CSV are just integers. CSV的值只是整数。

The Hbase cluster is running on AWS EMR, with plenty of memory (244GB) and compute (32 cores each, 4 nodes). Hbase集群在AWS EMR上运行,具有足够的内存(244GB)和计算能力(每个32核,4个节点)。

I'm trying to load the data into the database with this python script: 我正在尝试使用以下python脚本将数据加载到数据库中:

import happybase
import pandas as pd

connection = happybase.Connection('localhost')

familes = {
    's': dict(in_memory=True)

#connection.delete_table('exon', disable=True)
connection.create_table('exon', familes)

table = connection.table('exon')
df = pd.read_csv('exon.csv', nrows=1000)

col = list(df)
col = col[1:]

for index, row in df.iterrows():
    to_put = {}
    for col_name in col:
        to_put[('s:'+ col_name).encode('utf-8')] = str(row[col_name]).encode('utf-8')
    print('putting: ' + str(row[0]))
    table.put(row[0].encode('utf-8'), to_put)

When this script runs, only reading the first few rows, there is no issue: 运行此脚本时,仅读取前几行,就没有问题:

df = pd.read_csv('exon.csv', nrows=20)

However, reading more rows causes an error: 但是,读取更多行会导致错误:

df = pd.read_csv('exon.csv', nrows=1000)
putting: F1S4_160106_001_B01
Traceback (most recent call last):
  File "load.py", line 25, in <module>
    table.put(row[0].encode('utf-8'), to_put)
  File "/usr/local/lib/python3.6/site-packages/happybase/table.py", line 464, in put
    batch.put(row, data)
  File "/usr/local/lib/python3.6/site-packages/happybase/batch.py", line 137, in __exit__
  File "/usr/local/lib/python3.6/site-packages/happybase/batch.py", line 60, in send
    self._table.connection.client.mutateRows(self._table.name, bms, {})
  File "/usr/local/lib64/python3.6/site-packages/thriftpy2/thrift.py", line 200, in _req
    self._send(_api, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/thriftpy2/thrift.py", line 210, in _send
  File "/usr/local/lib64/python3.6/site-packages/thriftpy2/thrift.py", line 153, in write
  File "thriftpy2/protocol/cybin/cybin.pyx", line 477, in cybin.TCyBinaryProtocol.write_struct
  File "thriftpy2/protocol/cybin/cybin.pyx", line 474, in cybin.TCyBinaryProtocol.write_struct
  File "thriftpy2/protocol/cybin/cybin.pyx", line 212, in cybin.write_struct
  File "thriftpy2/protocol/cybin/cybin.pyx", line 356, in cybin.c_write_val
  File "thriftpy2/protocol/cybin/cybin.pyx", line 115, in cybin.write_list
  File "thriftpy2/protocol/cybin/cybin.pyx", line 362, in cybin.c_write_val
  File "thriftpy2/protocol/cybin/cybin.pyx", line 212, in cybin.write_struct
  File "thriftpy2/protocol/cybin/cybin.pyx", line 356, in cybin.c_write_val
  File "thriftpy2/protocol/cybin/cybin.pyx", line 115, in cybin.write_list
  File "thriftpy2/protocol/cybin/cybin.pyx", line 362, in cybin.c_write_val
  File "thriftpy2/protocol/cybin/cybin.pyx", line 209, in cybin.write_struct
  File "thriftpy2/protocol/cybin/cybin.pyx", line 71, in cybin.write_i08
  File "thriftpy2/transport/buffered/cybuffered.pyx", line 55, in thriftpy2.transport.buffered.cybuffered.TCyBufferedTransport.c_write
  File "thriftpy2/transport/buffered/cybuffered.pyx", line 80, in thriftpy2.transport.buffered.cybuffered.TCyBufferedTransport.c_flush
  File "/usr/local/lib64/python3.6/site-packages/thriftpy2/transport/socket.py", line 136, in write
BrokenPipeError: [Errno 32] Broken pipe

Is it just too much data inserted at once? 一次插入太多数据了吗? I've tried batch puts as well, the same issue comes up. 我也尝试过批处理,出现同样的问题。

Found my error - because I'm calling pandas.read_csv after I open the HappyBase connection, the connection times out. 发现我的错误-因为在打开HappyBase连接后我正在调用pandas.read_csv ,所以连接超时。 Calling read_csv before I open the connection remedied the problem. 在打开连接之前调用read_csv可以解决此问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM