使用python读取warc文件

Question

I want to read a warc file and I wrote the follwoing code based on this page but nothing was printted!! 我想读取一个warc文件，并基于此页面编写了以下代码，但未打印任何内容！

>>import warc
>>f = warc.open("01.warc.gz")
>>for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

However, when I wrote the following command I got result 但是，当我编写以下命令时，我得到了结果

>>print f
<warc.warc.WARCFile instance at 0x0000000002C7DE88>

Note that my warc file is one of the file from Clueweb09 dataset. 请注意，我的warc文件是Clueweb09数据集中的文件之一。 I mentioned it because of this page . 我之所以提到它是因为此页面。

Answer 1

I had the same problem as you. 我和你有同样的问题。

After some research on the module, I found a solution. 在对该模块进行了一些研究之后，我找到了解决方案。

Try to use record.payload.read() , here is full example: 尝试使用record.payload.read() ，这是完整的示例：

import warc
f = warc.open("01.warc.gz")
for record in f:
  print record.payload.read()

Also, I can say that you can not only read warc files, but wet too. 另外，我可以说，你不能只读warc文件，但wet了。 Small cheat is to renaming it to name, that contains .warc 小作弊是将其重命名为包含.warc名称

Kind regards 亲切的问候

Answer 2

First of all, WARC, or Web ARChive, is an archival format for web pages. 首先，WARC或Web ARChive是网页的一种归档格式。 Reading a warc file is a bit tricky because it contains some special header. 读取warc文件有点棘手，因为它包含一些特殊的标头。 Assuming your warc file is of this format . 假设您的warc文件具有这种格式。

You can use the following code to load, parse and return a dictionary for every record containing the metadata and the content. 您可以使用以下代码为包含元数据和内容的每条记录加载，解析和返回字典。

def read_header(file_handler):
    header = {}
    line = next(file_handler)
    while line != '\n':
        key, value = line.split(': ', 1)
        header[key] = value.rstrip()
        line = next(file_handler)
    return header


def warc_records(path):
    with open(path) as fh:
        while True:
            line = next(fh)
            if line == 'WARC/1.0\n':
                output = read_header(fh)
                if 'WARC-Refers-To' not in output:
                    continue
                output["Content"] = next(fh)
                yield output

You can access the dictionary as follow: 您可以按以下方式访问字典：

records = warc_records("<some path>')
>>> next_record = next(records)
>>> sorted(next_record.keys())
['Content', 'Content-Length', 'Content-Type', 'WARC-Block-Digest', 'WARC-Date', 'WARC-Record-ID', 'WARC-Refers-To', 'WARC-Target-URI', 'WARC-Type', 'WARC-Warcinfo-ID']
>>> next_record['WARC-Date']
'2013-06-20T00:32:15Z'
>>> next_record['WARC-Target-URI']
'http://09231204.tumblr.com/post/44534196170/high-res-new-photos-of-the-cast-of-neilhimself'
>>> next_record['Content'][:30]
'Side Effects high res. New pho'

使用python读取warc文件

问题描述

2 个解决方案

解决方案1
2 2017-03-16 16:14:14

解决方案2
0 2018-01-21 13:25:06

使用python读取warc文件

问题描述

2 个解决方案

解决方案1 2 2017-03-16 16:14:14

解决方案2 0 2018-01-21 13:25:06

解决方案1
2 2017-03-16 16:14:14

解决方案2
0 2018-01-21 13:25:06