简体   繁体   English

使用python读取warc文件

[英]Read warc file with python

I want to read a warc file and I wrote the follwoing code based on this page but nothing was printted!! 我想读取一个warc文件,并基于此页面编写了以下代码,但未打印任何内容!

>>import warc
>>f = warc.open("01.warc.gz")
>>for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

However, when I wrote the following command I got result 但是,当我编写以下命令时,我得到了结果

>>print f
<warc.warc.WARCFile instance at 0x0000000002C7DE88>

Note that my warc file is one of the file from Clueweb09 dataset. 请注意,我的warc文件是Clueweb09数据集中的文件之一。 I mentioned it because of this page . 我之所以提到它是因为此页面

I had the same problem as you. 我和你有同样的问题。

After some research on the module, I found a solution. 在对该模块进行了一些研究之后,我找到了解决方案。

Try to use record.payload.read() , here is full example: 尝试使用record.payload.read() ,这是完整的示例:

import warc
f = warc.open("01.warc.gz")
for record in f:
  print record.payload.read()

Also, I can say that you can not only read warc files, but wet too. 另外,我可以说,你不能只读warc文件,但wet了。 Small cheat is to renaming it to name, that contains .warc 小作弊是将其重命名为包含.warc名称

Kind regards 亲切的问候

First of all, WARC, or Web ARChive, is an archival format for web pages. 首先,WARC或Web ARChive是网页的一种归档格式。 Reading a warc file is a bit tricky because it contains some special header. 读取warc文件有点棘手,因为它包含一些特殊的标头。 Assuming your warc file is of this format . 假设您的warc文件具有这种格式

You can use the following code to load, parse and return a dictionary for every record containing the metadata and the content. 您可以使用以下代码为包含元数据和内容的每条记录加载,解析和返回字典。

def read_header(file_handler):
    header = {}
    line = next(file_handler)
    while line != '\n':
        key, value = line.split(': ', 1)
        header[key] = value.rstrip()
        line = next(file_handler)
    return header


def warc_records(path):
    with open(path) as fh:
        while True:
            line = next(fh)
            if line == 'WARC/1.0\n':
                output = read_header(fh)
                if 'WARC-Refers-To' not in output:
                    continue
                output["Content"] = next(fh)
                yield output

You can access the dictionary as follow: 您可以按以下方式访问字典:

records = warc_records("<some path>')
>>> next_record = next(records)
>>> sorted(next_record.keys())
['Content', 'Content-Length', 'Content-Type', 'WARC-Block-Digest', 'WARC-Date', 'WARC-Record-ID', 'WARC-Refers-To', 'WARC-Target-URI', 'WARC-Type', 'WARC-Warcinfo-ID']
>>> next_record['WARC-Date']
'2013-06-20T00:32:15Z'
>>> next_record['WARC-Target-URI']
'http://09231204.tumblr.com/post/44534196170/high-res-new-photos-of-the-cast-of-neilhimself'
>>> next_record['Content'][:30]
'Side Effects high res. New pho'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM