简体   繁体   English

摆脱从语料库生成文档的错误

[英]Getting rid of error generating documents from a corpus

I have a set of 1000 documents - encoded and compressed - in an lsm-db stored on my computer. 我在计算机上存储的lsm-db中有1000个文档集(经过编码和压缩)。 When I try to decompress and decode, I get an error that says "Incorrect Header Check". 当我尝试解压缩和解码时,出现错误消息“错误的标题检查”。

This is what I'm doing: 这就是我在做什么:

for key in my_lsm_db.keys():
    print key, zlib.decompress(my_lsm_db[key], zlib.MAX_WBITS|32).decode('utf-8')

After processing a few keys, the code throws an error. 处理了几个键后,代码将引发错误。 The error that I'm receiving is: error: Error -3 while decompressing data: incorrect header check 我收到的error: Error -3 while decompressing data: incorrect header check是: error: Error -3 while decompressing data: incorrect header check

I want to remove all such error generating documents from the corpus. 我想从语料库中删除所有此类错误生成文档。 How can I identify the documents that generate the error, so I could remove them? 如何识别产生错误的文档,以便将其删除?

def remove_docs(my_lsm_db):
    for key in my_lsm_db.keys():
        ## write code that identifies an error when generated
        if <code that identifies document generating error>:
            del my_lsm_db[key]



Here's some information on Zlib and MAX_WBITS part of the code: Zlib Compression , Stack Overflow Answer for Zlib Automatic Header Detection 以下是有关Zlib和MAX_WBITS部分代码的信息: Zlib压缩用于Zlib自动标头检测的堆栈溢出答案

I tried using a try/except block around my code to overcome such error generating documents. 我尝试在代码周围使用try / except块来克服此类错误生成文档。 It works for not just the above code, but other stuff also. 它不仅适用于上述代码,还适用于其他内容。

try:
    <code to execute>
except (<list of errors>) as e:
    print e

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM