简体   繁体   English

WGet下载顺序的逻辑

[英]WGet's Logic in Order of Downloading

This is a more general question, but which has wider implications for a data mining project I'm running. 这是一个更普遍的问题,但它对我正在运行的数据挖掘项目有更广泛的影响。 I have been using wget to mirror archival webpages for analysis. 我一直在使用wget镜像归档网页进行分析。 This is a large amount of data and my current mirroring process has been going on for almost a week. 这是一个大量的数据,我目前的镜像过程已经持续了将近一个星期。 Which has given me a lot of time to watch the readout. 这给了我很多时间来观看读数。

How does wget determine the order in which it downloads pages? wget如何确定下载页面的顺序? I can't seem to discern a consistant logic to its decision making process (it's not proceeding alphabetically, by date of original site creation, or by file type). 我似乎无法辨别其决策制定过程的一致性逻辑(它不按字母顺序,按原始网站创建日期或文件类型进行)。 As I begin to work with the data, this would be very helpful to grasp. 当我开始处理数据时,这将非常有助于掌握。

FWIW, here is the command that I'm using (it required cookies, and while the site's TOS do allow access 'by any means' I don't want to take any chances) - where SITE = URL: FWIW,这是我正在使用的命令(它需要cookie,而网站的TOS允许以任何方式“访问”我不想冒任何机会) - 其中SITE = URL:

wget -m --cookies=on --keep-session-cookies --load-cookies=cookie3.txt --save-cookies=cookie4.txt --referer=SITE --random-wait --wait=1 --limit-rate=30K --user-agent="Mozilla 4.0" SITE

Edited to Add: In comments to Chown's helpful answer, I refined my question a bit so here it is. 编辑添加:在对Chown的有用答案的评论中,我稍微改进了我的问题,所以在这里。 With larger sites - say epe.lac-bac.gc.ca/100/205/301/ic/cdc/E/Alphabet.asp - I find that it goes through initially creating a directory structure and some of the index.html/default.html pages, but then goes back through the disparate websites a few more times (grabbing a few more images and sub-pages on each pass, for example) 有更大的网站 - 比如epe.lac-bac.gc.ca/100/205/301/ic/cdc/E/Alphabet.asp - 我发现它最初创建了一个目录结构和一些index.html / default.html页面,然后再回到不同的网站几次(例如,在每次传递时抓取更多的图像和子页面)

From gnu.org wget Recursive Download : 来自gnu.org wget递归下载

  • Recursive Download 递归下载

GNU Wget is capable of traversing parts of the Web (or a single http or ftp server), following links and directory structure. GNU Wget能够遍历链接和目录结构的Web部分(或单个http或ftp服务器)。 We refer to this as to recursive retrieval, or recursion. 我们将此称为递归检索或递归。

With http urls, Wget retrieves and parses the html or css from the given url, retrieving the files the document refers to, through markup like href or src, or css uri values specified using the 'url()' functional notation. 使用http urls,Wget从给定的URL检索和解析html或css,检索文档引用的文件,通过href或src等标记,或使用'url()'功能表示法指定的css uri值。 If the freshly downloaded file is also of type text/html, application/xhtml+xml, or text/css, it will be parsed and followed further. 如果新下载的文件也是text / html,application / xhtml + xml或text / css类型,它将被解析并进一步跟踪。

Recursive retrieval of http and html/css content is breadth-first . http和html / css内容的递归检索是广度优先的 This means that Wget first downloads the requested document, then the documents linked from that document, then the documents linked by them, and so on. 这意味着Wget首先下载所请求的文档,然后是从该文档链接的文档,然后是由它们链接的文档,依此类推。 In other words, Wget first downloads the documents at depth 1, then those at depth 2, and so on until the specified maximum depth. 换句话说,Wget首先下载深度为1的文档,然后下载深度为2的文档,依此类推,直到达到指定的最大深度。

The maximum depth to which the retrieval may descend is specified with the '-l' option. 可以使用'-l'选项指定检索可以下降的最大深度。 The default maximum depth is five layers. 默认最大深度为五层。

When retrieving an ftp url recursively, Wget will retrieve all the data from the given directory tree (including the subdirectories up to the specified depth) on the remote server, creating its mirror image locally. 在递归检索ftp url时,Wget将从远程服务器上的给定目录树(包括指定深度的子目录)中检索所有数据,并在本地创建其镜像。 ftp retrieval is also limited by the depth parameter. ftp检索也受深度参数的限制。 Unlike http recursion, ftp recursion is performed depth-first. 与http递归不同,ftp递归是深度优先执行的。

By default, Wget will create a local directory tree, corresponding to the one found on the remote server. 默认情况下,Wget将创建一个本地目录树,对应于远程服务器上找到的目录树。

.... snip .... ....剪断....

Recursive retrieval should be used with care. 应谨慎使用递归检索。 Don't say you were not warned. 不要说你没有被警告过。


From my own very basic testing, it goes in order of appearance from top to bottom of the page when the structure depth is 1: 从我自己的基本测试开始,当结构深度为1时,它按照从页面顶部到底部的外观顺序排列:

[ 16:28 root@host /var/www/html ]# cat index.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en-US">
    <head>
        <link rel="stylesheet" type="text/css" href="style.css">
    </head>
    <body>
        <div style="text-align:center;">
            <h2>Mobile Test Page</h2>
        </div>
        <a href="/c.htm">c</a>
        <a href="/a.htm">a</a>
        <a href="/b.htm">b</a>
    </body>
</html>



[ 16:28 jon@host ~ ]$ wget -m http://98.164.214.224:8000
--2011-10-15 16:28:51--  http://98.164.214.224:8000/
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 556 [text/html]
Saving to: "98.164.214.224:8000/index.html"

100%[====================================================================================================================================================================================================>] 556         --.-K/s   in 0s

2011-10-15 16:28:51 (19.7 MB/s) - "98.164.214.224:8000/index.html" saved [556/556]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/style.css
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 221 [text/css]
Saving to: "98.164.214.224:8000/style.css"

100%[====================================================================================================================================================================================================>] 221         --.-K/s   in 0s

2011-10-15 16:28:51 (777 KB/s) - "98.164.214.224:8000/style.css" saved [221/221]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/c.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 0 [text/html]
Saving to: "98.164.214.224:8000/c.htm"

    [ <=>                                                                                                                                                                                                 ] 0           --.-K/s   in 0s

2011-10-15 16:28:51 (0.00 B/s) - "98.164.214.224:8000/c.htm" saved [0/0]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/a.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2 [text/html]
Saving to: "98.164.214.224:8000/a.htm"

100%[====================================================================================================================================================================================================>] 2           --.-K/s   in 0s

2011-10-15 16:28:51 (102 KB/s) - "98.164.214.224:8000/a.htm" saved [2/2]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/b.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2 [text/html]
Saving to: "98.164.214.224:8000/b.htm"

100%[====================================================================================================================================================================================================>] 2           --.-K/s   in 0s

2011-10-15 16:28:51 (85.8 KB/s) - "98.164.214.224:8000/b.htm" saved [2/2]

FINISHED --2011-10-15 16:28:51--
Downloaded: 5 files, 781 in 0s (2.15 MB/s)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM