简体   繁体   English

无法通过 python happybase - HDP 3 在 Hbase 中上传大小超过 10MB 的 pdf 文件

[英]unable to upload pdf files of size more than 10MB in Hbase via python happybase - HDP 3

We are using HDP 3. We are trying to insert PDF files in one of the columns of a particular column family in Hbase table.我们正在使用 HDP 3。我们试图在 Hbase 表中特定列族的列之一中插入 PDF 文件。 Developing environment is python 3.6 and the hbase connector is happybase 1.1.0.开发环境为python 3.6,hbase连接器为happybase 1.1.0。

We are unable to upload any PDF file greater than 10 MB in hbase.我们无法在 hbase 中上传任何大于 10 MB 的 PDF 文件。

In hbase we have set the parameters as follows:在hbase中我们设置了如下参数: 在此处输入图片说明

在此处输入图片说明

We get the following error:我们收到以下错误:

IOError(message=b'org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: org.apache.hadoop.hbase.DoNotRetryIOException: Cell with size 80941994 exceeds limit of 10485760 bytes\\n\\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.checkCellSizeLimit(RSRpcServices.java:937)\\n\\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:1010)\\n\\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicBatchOp(RSRpcServices.java:959)\\n\\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:922)\\n\\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2683)\\n\\tat org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42014)\\n\\tat org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)\\n\\tat org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)\\n\\tat org.apache.hadoop.hbase.ipc.RpcE IOError(message=b'org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: org.apache.hadoop.hbase.DoNotRetryIOException: 单元格大小为 80941994 超过 10485760 字节的限制\\n\\tat org.apache.hadoop .hbase.regionserver.RSRpcServices.checkCellSizeLimit(RSRpcServices.java:937)\\n\\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:1010)\\n\\tat org.apache.hadoop.hbase .regionserver.RSRpcServices.doNonAtomicBatchOp(RSRpcServices.java:959)\\n\\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:922)\\n\\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:922)\\n\\tat org.apache.hadoop.hbase.regionserver. .RSRpcServices.multi(RSRpcServices.java:2683)\\n\\tat org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42014)\\n\\tat org.apache. hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)\\n\\tat org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)\\n\\tat org.apache.hadoop。 hbase.ipc.RpcE xecutor$Handler.run(RpcExecutor.java:324)\\n\\tat xecutor$Handler.run(RpcExecutor.java:324)\\n\\tat

You have to check the hbase source code to see what is happening:您必须检查hbase 源代码以查看发生了什么:

private void checkCellSizeLimit(final HRegion r, final Mutation m) throws IOException {
    945    if (r.maxCellSize > 0) {
    946      CellScanner cells = m.cellScanner();
    947      while (cells.advance()) {
    948        int size = PrivateCellUtil.estimatedSerializedSizeOf(cells.current());
    949        if (size > r.maxCellSize) {
    950          String msg = "Cell with size " + size + " exceeds limit of " + r.maxCellSize + " bytes";
    951          if (LOG.isDebugEnabled()) {
    952            LOG.debug(msg);
    953          }
    954          throw new DoNotRetryIOException(msg);
    955        }
    956      }
    957    }
    958  }

Based on the error message you are exceeding the r.maxCellSize .根据错误消息,您超出了r.maxCellSize

Note on above: The function PrivateCellUtil.estimatedSerializedSizeOf is depreciated and will be removed in the future versions.上述注意事项:函数PrivateCellUtil.estimatedSerializedSizeOf已贬值,将在未来版本中删除。

Here is its description:这是它的描述:

Estimate based on keyvalue's serialization format in the RPC layer.根据 RPC 层中 keyvalue 的序列化格式进行估计。 Note that there is an extra SIZEOF_INT added to the size here that indicates the actual length of the cell for cases where cell's are serialized in a contiguous format (For eg in RPCs).请注意,此处的大小中添加了一个额外的 SIZEOF_INT,用于指示单元格以连续格式序列化的情况下的单元格的实际长度(例如,在 RPC 中)。

You have to check where is the value set.您必须检查值设置在哪里。 First check the "ordinary" values at HRegion.java首先检查HRegion.java 中的“普通”值

this.maxCellSize = conf.getLong(HBASE_MAX_CELL_SIZE_KEY, DEFAULT_MAX_CELL_SIZE);

So there is probably a HBASE_MAX_CELL_SIZE_KEY and DEFAULT_MAX_CELL_SIZE limit somewhere :因此, 某处可能存在HBASE_MAX_CELL_SIZE_KEYDEFAULT_MAX_CELL_SIZE限制:

public static final String HBASE_MAX_CELL_SIZE_KEY = "hbase.server.keyvalue.maxsize";
public static final int DEFAULT_MAX_CELL_SIZE = 10485760;

Here you have your 10485760 limit which shows at your error message.在这里,您有10485760限制,显示在您的错误消息中。 If you need you can try raising this limit to your limit value.如果您需要,您可以尝试将此限制提高到您的限制值。 I recommend testing it properly before going live with it (the limit there has probably some reason behind it).我建议在使用它之前正确测试它(限制可能有一些背后的原因)。

Edit: Adding information about how to change the value of base.server.keyvalue.maxsize .编辑:添加有关如何更改base.server.keyvalue.maxsize值的base.server.keyvalue.maxsize Check the config.files :检查config.files

Where you can read:你可以在哪里阅读:

hbase.client.keyvalue.maxsize (Description) hbase.client.keyvalue.maxsize (说明)

Specifies the combined maximum allowed size of a KeyValue instance.指定 KeyValue 实例的组合最大允许大小。 This is to set an upper boundary for a single entry saved in a storage file.这是为存储文件中保存的单个条目设置上限。 Since they cannot be split it helps avoiding that a region cannot be split any further because the data is too large.由于它们无法拆分,因此有助于避免由于数据太大而无法进一步拆分区域。 It seems wise to set this to a fraction of the maximum region size.将其设置为最大区域大小的一小部分似乎是明智的。 Setting it to zero or less disables the check.将其设置为零或更少会禁用检查。 Default默认

10485760

hbase.server.keyvalue.maxsize (Description) hbase.server.keyvalue.maxsize (说明)

Maximum allowed size of an individual cell, inclusive of value and all key components.单个单元格的最大允许大小,包括值和所有关键组件。 A value of 0 or less disables the check. 0 或更小的值将禁用检查。 The default value is 10MB.默认值为 10MB。 This is a safety setting to protect the server from OOM situations.这是保护服务器免受 OOM 情况的安全设置。 Default默认

10485760

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 读取大型二进制文件(超过 500 MB)的最快方法? - Fastest Way to Read Large Binary Files (More than 500 MB)? 谷歌云功能在流数据到大查询时达到 10MB 速率限制,而不管使用块? - Google cloud functions hit 10MB rate limits while streaming data to big query regardless of the using chunks? 您如何允许用户通过终端在 Python 脚本中上传 2 个 excel 文件,而不是下载新的文件文件? - How do you allow users to upload 2 excel files in a Python script via terminal, than download a new file file? 如何添加身份验证/安全性以使用happybase访问HBase? - How do I add authentication/security to access HBase using happybase? 如何使用 GAE 和 Python 3 从 GCS 提供大于 32 MB 的文件 - How to serve files larger than 32 MB from GCS using GAE and Python 3 从 Python 中的 pdf 文件中提取固定大小和 position 表 - Extract fixed size and position table from pdf files in Python 使用python将文件移动到多个文件夹 - Move files to more than one folder using python 使用具有超过 1MB 数据的 gsi 查询 dynamodb 函数 - query dynamodb function using gsi with more than 1MB of data 排序操作使用的 RAM 超过最大 33.5 MB - Sort operation used more than the maximum 33.5 MB of RAM python中无法使用Google Drive上传文件 API - Unable to upload files using Google Drive API in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM