简体   繁体   English

WARC 文件中的记录数

[英]Number of records in WARC file

I currently parsing WARC files from CommonCrawl corpus and I would like to know upfront, without iterating through all WARC records, how many records are there.我目前正在从 CommonCrawl 语料库中解析 WARC 文件,并且我想预先知道有多少条记录,而无需遍历所有 WARC 记录。

Does WARC 1.1 standard defines such information? WARC 1.1 标准是否定义了此类信息?

The WARC standard does not define a standard way to indicate the number of WARC records in the WARC file itself. WARC 标准没有定义一种标准方式来指示 WARC 文件本身中的 WARC 记录数。 The number of response records in Common Crawl WARC files is usually between 30,000 and 50,000 - note that there are also request and metadata records. Common Crawl WARC 文件中的响应记录数量通常在 30,000 到 50,000 之间 - 请注意,还有请求和元数据记录。 The WARC standard recommends 1 GB as target size of WARC files which puts a natural limit to the number of records. WARC 标准建议将1 GB 作为 WARC 文件的目标大小,这对记录数进行了自然限制。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM