简体   繁体   English

无法从从普通爬网爬网的warc文件中找到URL

[英]cannot find url from a warc file crawled from common crawl

I have crawled data from common crawl and I want to find out url corresponding to each of the records. 我已经从常规爬网中爬网了数据,我想找出与每个记录相对应的url。

for record in files:
     print record['WARC-Target-URI']

This outputs an empty list. 这将输出一个空列表。 I am referring to the following link https://dmorgan.info/posts/common-crawl-python/ . 我指的是以下链接https://dmorgan.info/posts/common-crawl-python/ Do we get target uri corresponding to each of the record or just one target uri for one warc file path ? 我们是否获得对应于每个记录的目标uri或仅一个warc文件路径的一个目标uri?

The info you're after is part of the header. 您关注的信息是标题的一部分。 Try: 尝试:

print record.header['WARC-Target-URI']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM