简体   繁体   English

Python无法完全读取“ warc.gz”文件

[英]Python cannot read “warc.gz” file completely

For my work, I scrape web-sites and write them to gzipped web-archives (with extension "warc.gz"). 对于我的工作,我会刮擦网站并将其写入压缩的Web归档文件(扩展名为“ warc.gz”)。 I use Python 2.7.11 and the warc 0.2.1 library. 我使用Python 2.7.11和warc 0.2.1库。

I noticed that for majority of files I cannot read them completely with the warc-library. 我注意到对于大多数文件,我无法使用warc库完全读取它们。 For example if the warc.gz file has 517 records, I can read only about 200 of them. 例如,如果warc.gz文件具有517条记录,那么我只能读取其中的200条记录。

After some research I found out that this problem happens only with the gzipped files. 经过一些研究,我发现此问题仅在压缩文件中发生。 The files with extension "warc" do not have this problem. 扩展名为“ warc”的文件没有此问题。

I have found out that some people have this problem as well ( https://github.com/internetarchive/warc/issues/21 ), while no solution for it is found. 我发现有些人也有这个问题( https://github.com/internetarchive/warc/issues/21 ),但没有找到解决方案。

I guess that there might be a bug in "gzip" in Python 2.7.11. 我猜想Python 2.7.11中的“ gzip”中可能存在错误。 Does maybe someone have experience with this, and know what can be done about this problem? 也许有人对此有经验,并且知道该问题可以做什么?

Thanks in advance! 提前致谢!

Example: 例:

I create new warc.gz files like this: 我创建新的warc.gz文件,如下所示:

import warc
warc_path = "\\some_path\file_name.warc.gz"
warc_file = warc.open(warc_path, "wb")

To write records I use: 要写记录,我使用:

record = warc.WARCRecord(payload=value, headers=headers)
warc_file.write_record(record)

This creates perfect "warc.gz" files. 这将创建完美的“ warc.gz”文件。 There are no problems with them. 他们没有问题。 All, including "\\r\\n" is correct. 包括“ \\ r \\ n”在内的所有内容都是正确的。 But the problem starts when I read these files. 但是,当我读取这些文件时,问题开始了。

To read files I use: 要读取文件,我使用:

warc_file = warc.open(warc_path, "rb")

To loop through records I use: 要遍历记录,我使用:

for record in warc_file:
    ...

The problem is that not all records are found during this looping for "warc.gz" file, while they all are found for "warc" files. 问题在于,在此循环中找不到“ warc.gz”文件的所有记录,而找到了所有“ warc”文件的记录。 Working with both types of files is addressed in the warc-library itself. 在warc库本身中解决了处理这两种类型的文件的问题。

It seems that the custom gzip handling in warc.gzip2.GzipFile , file splitting with warc.utils.FilePart and reading in warc.warc.WARCReader is broken as a whole (tested with python 2.7.9, 2.7.10 and 2.7.11). 看来,自定义gzip处理中warc.gzip2.GzipFile ,用文件拆分warc.utils.FilePart和阅读warc.warc.WARCReader被打破作为一个整体(与Python 2.7.9,2.7.10和2.7.11测试)。 It stops short when it receives no data instead of a new header . 当它没有接收到数据而不是新的报头时,它将停止。

It would seem that basic stdlib gzip handles the catenated files just fine and so this should work as well: 似乎基本的stdlib gzip可以很好地处理链接的文件,因此这也应该可以正常工作:

import gzip
import warc

with gzip.open('my_test_file.warc.gz', mode='rb') as gzf:
    for record in warc.WARCFile(fileobj=gzf):
        print record.payload.read()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM