简体   繁体   English

根据标题将 WARC 文件拆分为多个块:WARC/1.0 Python

[英]Splitting a WARC file into chunks based on the header: WARC/1.0 Python

I'm new to programming and am trying to process a WARC file by splitting it into chunks and then storing each chunk in a dictionary.我是编程新手,正在尝试通过将 WARC 文件拆分成块然后将每个块存储在字典中来处理它。

Each chunk should start with the WARC/1.0 header and is separated by 3 empty lines.每个块应以 WARC/1.0 标头开头,并由 3 个空行分隔。 I also would like to remove the first 2 paragraphs:我还想删除前两段:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2020-08-04T01:43:40Z
WARC-Record-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
Content-Length: 500
Content-Type: application/warc-fields
WARC-Filename: CC-MAIN-20200804014340-20200804044340-00045.warc.gz

isPartOf: CC-MAIN-2020-34
publisher: Common Crawl
description: Wide crawl of the web for August 2020
operator: Common Crawl Admin (info@commoncrawl.org)
hostname: ip-10-67-67-22.ec2.internal
software: Apache Nutch 1.17 (modified, https://github.com/commoncrawl/nutch/)
robots: checked via crawler-commons 1.2-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
format: WARC File Format 1.1
conformsTo: http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/

#Keep everything from here down: #从这里开始保留所有内容:

WARC/1.0
WARC-Type: request
WARC-Date: 2020-08-04T03:25:25Z
WARC-Record-ID: <urn:uuid:6c0b749a-4d02-4a77-ab93-9bc4ba094cdc>
Content-Length: 303
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
WARC-IP-Address: 104.254.66.40
WARC-Target-URI: http://00.auto.sohu.com/d/details?cityCode=450100&planId=1450&trimId=145372

I've tried using a generator to group the chunks, but it's returning one group (the whole file).我尝试使用生成器对块进行分组,但它返回了一组(整个文件)。 Is there a simple way to separate these?有没有一种简单的方法可以将它们分开?

I can't import any libraries.我无法导入任何库。

Any help would be greatly appreciated!!任何帮助将不胜感激!!

By far the best way to do this task is to use the warcio library, which knows how to properly parse warc files into records.到目前为止,完成此任务的最佳方法是使用 warcio 库,该库知道如何正确地将 warc 文件解析为记录。

Barring that, I would copy the warcio code into yours (the license is permissive.)除此之外,我会将 warcio 代码复制到您的代码中(许可证是许可的。)

Warc files are complicated, and using a fully tested and widely used library is the right way to parse them. Warc 文件很复杂,使用经过充分测试和广泛使用的库是解析它们的正确方法。

If you're downloading data from Common Crawl, I would also recommend checking out my python package cdx_toolkit.如果您从 Common Crawl 下载数据,我还建议您查看我的 python 包 cdx_toolkit。 It uses warcio under the hood, and handles the downloading steps.它在引擎盖下使用 warcio,并处理下载步骤。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM