摆脱从语料库生成文档的错误

Question

I have a set of 1000 documents - encoded and compressed - in an lsm-db stored on my computer. 我在计算机上存储的lsm-db中有1000个文档集（经过编码和压缩）。 When I try to decompress and decode, I get an error that says "Incorrect Header Check". 当我尝试解压缩和解码时，出现错误消息“错误的标题检查”。

This is what I'm doing: 这就是我在做什么：

for key in my_lsm_db.keys():
    print key, zlib.decompress(my_lsm_db[key], zlib.MAX_WBITS|32).decode('utf-8')

After processing a few keys, the code throws an error. 处理了几个键后，代码将引发错误。 The error that I'm receiving is: error: Error -3 while decompressing data: incorrect header check 我收到的error: Error -3 while decompressing data: incorrect header check是： error: Error -3 while decompressing data: incorrect header check

I want to remove all such error generating documents from the corpus. 我想从语料库中删除所有此类错误生成文档。 How can I identify the documents that generate the error, so I could remove them? 如何识别产生错误的文档，以便将其删除？

def remove_docs(my_lsm_db):
    for key in my_lsm_db.keys():
        ## write code that identifies an error when generated
        if <code that identifies document generating error>:
            del my_lsm_db[key]

Here's some information on Zlib and MAX_WBITS part of the code: Zlib Compression , Stack Overflow Answer for Zlib Automatic Header Detection 以下是有关Zlib和MAX_WBITS部分代码的信息： Zlib压缩，用于Zlib自动标头检测的堆栈溢出答案

Answer 1

I tried using a try/except block around my code to overcome such error generating documents. 我尝试在代码周围使用try / except块来克服此类错误生成文档。 It works for not just the above code, but other stuff also. 它不仅适用于上述代码，还适用于其他内容。

try:
    <code to execute>
except (<list of errors>) as e:
    print e

摆脱从语料库生成文档的错误

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-04-28 15:39:27

摆脱从语料库生成文档的错误

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-04-28 15:39:27

解决方案1
0 已采纳 2017-04-28 15:39:27