[英]Getting rid of error generating documents from a corpus
I have a set of 1000 documents - encoded and compressed - in an lsm-db stored on my computer. 我在计算机上存储的lsm-db中有1000个文档集(经过编码和压缩)。 When I try to decompress and decode, I get an error that says "Incorrect Header Check".
当我尝试解压缩和解码时,出现错误消息“错误的标题检查”。
This is what I'm doing: 这就是我在做什么:
for key in my_lsm_db.keys():
print key, zlib.decompress(my_lsm_db[key], zlib.MAX_WBITS|32).decode('utf-8')
After processing a few keys, the code throws an error. 处理了几个键后,代码将引发错误。 The error that I'm receiving is:
error: Error -3 while decompressing data: incorrect header check
我收到的
error: Error -3 while decompressing data: incorrect header check
是: error: Error -3 while decompressing data: incorrect header check
I want to remove all such error generating documents from the corpus.
我想从语料库中删除所有此类错误生成文档。
How can I identify the documents that generate the error, so I could remove them?
如何识别产生错误的文档,以便将其删除?
def remove_docs(my_lsm_db):
for key in my_lsm_db.keys():
## write code that identifies an error when generated
if <code that identifies document generating error>:
del my_lsm_db[key]
Here's some information on Zlib and MAX_WBITS
part of the code: Zlib Compression , Stack Overflow Answer for Zlib Automatic Header Detection 以下是有关Zlib和
MAX_WBITS
部分代码的信息: Zlib压缩 , 用于Zlib自动标头检测的堆栈溢出答案
I tried using a try/except block around my code to overcome such error generating documents. 我尝试在代码周围使用try / except块来克服此类错误生成文档。 It works for not just the above code, but other stuff also.
它不仅适用于上述代码,还适用于其他内容。
try:
<code to execute>
except (<list of errors>) as e:
print e
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.