wget --warc-file 只获取主页和机器人页面？

Question

I am trying to do a little project on a small-ish WARC file.我正在尝试在一个小的 WARC 文件上做一个小项目。 I used this command:我使用了这个命令：

[ ! -f course.warc.gz ] && wget -r -l 3 "https://www.ru.nl/datascience/" --delete-after --no-directories --warc-file="course" || echo Most likely, course.warc.gz already exists

First time I ran it, everything went fine, got over 150 pages worth, amazing.我第一次运行它时，一切都很顺利，价值超过 150 页，太棒了。 Now I wanted to redo it from scratch, so I deleted the file 'course.warc.gz';现在我想从头开始重做，所以我删除了文件“course.warc.gz”； problem is, when I run the same command now I get 3 pages: the one requested for, and two robot pages to boot.问题是，当我现在运行相同的命令时，我得到 3 个页面：请求的页面和要引导的两个机器人页面。 Why is this happening?为什么会这样？

Answer 1

Wget can follow links in HTML, [...] This is sometimes referred to as “recursive downloading.” Wget 可以跟随 HTML 中的链接，[...] 这有时被称为“递归下载”。 While doing that, Wget respects the Robot Exclusion Standard (/robots.txt).在这样做的同时，Wget 尊重机器人排除标准 (/robots.txt)。 ( wget manual ) （ wget手册）

The robots.txt includes the following rule: robots.txt包含以下规则：

# Block alle andere spiders
User-agent: *
Disallow: /

Difficult to answer whether what happened during the previous run of wget.很难回答在上次运行 wget 期间是否发生了什么。 Maybe the robots.txt changed?也许 robots.txt 改变了？

wget --warc-file 只获取主页和机器人页面？

问题描述

1 个解决方案

解决方案1
0 2022-05-21 09:17:24

wget --warc-file 只获取主页和机器人页面？

问题描述

1 个解决方案

解决方案1 0 2022-05-21 09:17:24

解决方案1
0 2022-05-21 09:17:24