[英]Python: How to split WARC file?
My goal is to split and sort WARC file from CommonCrawl into its individual records.我的目标是将 CommonCrawl 中的 WARC 文件拆分和排序到其各个记录中。 Example file:
示例文件:
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2020-08-04T01:43:40Z
WARC-Record-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
Content-Length: 500
Content-Type: application/warc-fields
WARC-Filename: CC-MAIN-20200804014340-20200804044340-00045.warc.gz
isPartOf: CC-MAIN-2020-34
publisher: Common Crawl
description: Wide crawl of the web for August 2020
operator: Common Crawl Admin (info@commoncrawl.org)
hostname: ip-10-67-67-22.ec2.internal
software: Apache Nutch 1.17 (modified, https://github.com/commoncrawl/nutch/)
robots: checked via crawler-commons 1.2-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
format: WARC File Format 1.1
conformsTo: http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
WARC/1.0
WARC-Type: request
WARC-Date: 2020-08-04T03:25:25Z
WARC-Record-ID: <urn:uuid:6c0b749a-4d02-4a77-ab93-9bc4ba094cdc>
Content-Length: 303
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
WARC-IP-Address: 104.254.66.40
WARC-Target-URI: http://00.auto.sohu.com/d/details?cityCode=450100&planId=1450&trimId=145372
How can I split the file into its different records at the line: "WARC/1.0"?如何将文件拆分为以下行的不同记录:“WARC/1.0”?
You can do it with "warcio" lib.你可以用“warcio”库来做到这一点。
Example code:示例代码:
import requests
from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter
def split_records(url):
resp = requests.get(url, stream=True)
for record in ArchiveIterator(resp.raw, arc2warc=True):
if record.rec_type == 'warcinfo':
print(record.raw_stream.read())
elif record.rec_type == 'response':
id = record.rec_headers.get_header('WARC-Record-ID').rsplit(':', 1)[-1].rstrip('>')
print(id)
output = open('%s.warc.gz' % (id), 'wb')
writer = WARCWriter(output, gzip=True)
writer.write_record(record)
output.close()
split_records('https://cdn.ruarxive.org/public/webcollect2020/kgi/komitetgi.ru/komitetgi.ru.warc.gz')
It will split the WARC file into single WARC records stored in the same path as script file.它将 WARC 文件拆分为单个 WARC 记录,存储在与脚本文件相同的路径中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.