简体   繁体   English

wget递归在Wiki页面上失败

[英]wget recursive fails on wiki pages

I'm trying to recursively fetch all pages linked from a Moin wiki page. 我试图递归地获取从Moin Wiki页面链接的所有页面。 I've tried many different wget recursive options, which all have the same result: only the html file from the given URL gets downloaded, not any of the pages linked from that html page. 我尝试了许多不同的wget递归选项,这些选项都具有相同的结果:仅下载给定URL中的html文件,而不下载从该html页面链接的任何页面。

If I use the --convert-links option, wget correctly translates the unfetched links to the right web links. 如果我使用--convert-links选项,则wget --convert-links正确转换为正确的Web链接。 It just doesn't recursively download those linked pages. 它只是不递归地下载那些链接的页面。

wget --verbose -r https://wiki.gnome.org/Outreachy
--2017-03-02 10:34:03--  https://wiki.gnome.org/Outreachy
Resolving wiki.gnome.org (wiki.gnome.org)... 209.132.180.180, 209.132.180.168
Connecting to wiki.gnome.org (wiki.gnome.org)|209.132.180.180|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘wiki.gnome.org/Outreachy’

wiki.gnome.org/Outreachy                                      [  <=>                                                                                                                                ]  52.80K   170KB/s    in 0.3s    

2017-03-02 10:34:05 (170 KB/s) - ‘wiki.gnome.org/Outreachy’ saved [54064]

FINISHED --2017-03-02 10:34:05--
Total wall clock time: 1.4s
Downloaded: 1 files, 53K in 0.3s (170 KB/s)

I'm not sure if it's failing because the wiki's html links don't end with .html. 我不确定是否失败,因为Wiki的html链接不以.html结尾。 I've tried using various combinations of --accept='[a-zA-Z0-9]+' , --page-requisites , and --accept-regex='[a-zA-Z0-9]+' to work around that, no luck. 我已经采用的各种组合试过--accept='[a-zA-Z0-9]+'--page-requisites ,和--accept-regex='[a-zA-Z0-9]+'解决这个问题,没有运气。

I'm not sure if it's failing because the wiki has html pages like https://wiki.gnome.org/Outreachy that links page URLs like https://wiki.gnome.org/Outreachy/Admin and https://wiki.gnome.org/Outreachy/Admin/GettingStarted . 我不确定是否会失败,因为Wiki具有链接诸如https://wiki.gnome.org/Outreachy/Adminhttps://wiki.gnome.org/Outreachy/Admin/GettingStarted等页面URL的html页面,如https://wiki.gnome.org/Outreachy https://wiki.gnome.org/Outreachy/Admin/GettingStarted Maybe wget is confused because there will need to be an HTML page and a directory with the same name? 也许wget感到困惑,因为将需要一个HTML页面和一个具有相同名称的目录? I also tried using --nd but no luck. 我也尝试使用--nd但没有运气。

The linked html pages are all relative to the base wiki URL (eg <a href="/Outreachy/History">Outreachy history page</a> ). 链接的html页面都是相对于基本Wiki URL的(例如<a href="/Outreachy/History">Outreachy history page</a> )。 I've tried also adding --base="https://wiki.gnome.org/ with no luck. 我也尝试过添加--base="https://wiki.gnome.org/ ,但没有运气。

At this point, I've tried a whole lot of different wget options, read several stack overflow and unix.stackexchange.com questions, and nothing I've tried has worked. 在这一点上,我已经尝试了很多不同的wget选项,阅读了几个堆栈溢出和unix.stackexchange.com问题,但我尝试过的任何方法都没有用。 I'm hoping there's a wget expert that can look at this particular wiki page and figure why wget is failing to recursively fetch linked pages. 我希望有一个wget专家可以查看这个特定的wiki页面,并弄清楚为什么wget无法递归地获取链接的页面。 The same options work fine on other domains. 相同的选项在其他域上也可以正常工作。

I've also tried httrack, with the same result. 我也尝试过httrack,结果相同。 I'm running Linux, so please don't suggest Windows or proprietary tools. 我正在运行Linux,所以请不要建议使用Windows或专有工具。

This seems to be caused by the following tag in the wiki: 这似乎是由Wiki中的以下标记引起的:

<meta name="robots" content="index,nofollow">

If you are sure you want to ignore the tag, you can make wget ignore it using -e robots=off : 如果确定要忽略该标记,则可以使用-e robots=off使wget忽略该标记:

wget -e robots=off --verbose -r https://wiki.gnome.org/Outreachy

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM