无法从从普通爬网爬网的warc文件中找到URL

Question

I have crawled data from common crawl and I want to find out url corresponding to each of the records. 我已经从常规爬网中爬网了数据，我想找出与每个记录相对应的url。

for record in files:
     print record['WARC-Target-URI']

This outputs an empty list. 这将输出一个空列表。 I am referring to the following link https://dmorgan.info/posts/common-crawl-python/ . 我指的是以下链接https://dmorgan.info/posts/common-crawl-python/ 。 Do we get target uri corresponding to each of the record or just one target uri for one warc file path ? 我们是否获得对应于每个记录的目标uri或仅一个warc文件路径的一个目标uri？

Answer 1

The info you're after is part of the header. 您关注的信息是标题的一部分。 Try: 尝试：

print record.header['WARC-Target-URI']

无法从从普通爬网爬网的warc文件中找到URL

问题描述

1 个解决方案

解决方案1
1 2017-07-18 12:37:26

无法从从普通爬网爬网的warc文件中找到URL

问题描述

1 个解决方案

解决方案1 1 2017-07-18 12:37:26

解决方案1
1 2017-07-18 12:37:26