根據標題將 WARC 文件拆分為多個塊：WARC/1.0 Python

Question

我是編程新手，正在嘗試通過將 WARC 文件拆分成塊然后將每個塊存儲在字典中來處理它。

每個塊應以 WARC/1.0 標頭開頭，並由 3 個空行分隔。 我還想刪除前兩段：

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2020-08-04T01:43:40Z
WARC-Record-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
Content-Length: 500
Content-Type: application/warc-fields
WARC-Filename: CC-MAIN-20200804014340-20200804044340-00045.warc.gz

isPartOf: CC-MAIN-2020-34
publisher: Common Crawl
description: Wide crawl of the web for August 2020
operator: Common Crawl Admin (info@commoncrawl.org)
hostname: ip-10-67-67-22.ec2.internal
software: Apache Nutch 1.17 (modified, https://github.com/commoncrawl/nutch/)
robots: checked via crawler-commons 1.2-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
format: WARC File Format 1.1
conformsTo: http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/

#從這里開始保留所有內容：

WARC/1.0
WARC-Type: request
WARC-Date: 2020-08-04T03:25:25Z
WARC-Record-ID: <urn:uuid:6c0b749a-4d02-4a77-ab93-9bc4ba094cdc>
Content-Length: 303
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
WARC-IP-Address: 104.254.66.40
WARC-Target-URI: http://00.auto.sohu.com/d/details?cityCode=450100&planId=1450&trimId=145372

我嘗試使用生成器對塊進行分組，但它返回了一組（整個文件）。 有沒有一種簡單的方法可以將它們分開？

我無法導入任何庫。

任何幫助將不勝感激！！

Answer 1

到目前為止，完成此任務的最佳方法是使用 warcio 庫，該庫知道如何正確地將 warc 文件解析為記錄。

除此之外，我會將 warcio 代碼復制到您的代碼中（許可證是許可的。）

Warc 文件很復雜，使用經過充分測試和廣泛使用的庫是解析它們的正確方法。

如果您從 Common Crawl 下載數據，我還建議您查看我的 python 包 cdx_toolkit。 它在引擎蓋下使用 warcio，並處理下載步驟。

根據標題將 WARC 文件拆分為多個塊：WARC/1.0 Python

問題描述

1 個解決方案

解決方案1
1 2020-10-06 19:56:33

根據標題將 WARC 文件拆分為多個塊：WARC/1.0 Python

問題描述

1 個解決方案

解決方案1 1 2020-10-06 19:56:33

解決方案1
1 2020-10-06 19:56:33