Python - 整个网页的urlretrieve

Question

with urllib.urlretrieve('http://page.com', 'page.html') I can save the index page and only the index of page.com. 使用urllib.urlretrieve('http://page.com', 'page.html') ://page.com'，'page.html urllib.urlretrieve('http://page.com', 'page.html')我可以保存索引页面，只保存page.com的索引。 Does urlretrieve handle something similar to wget -r that let's me download the entire web page structure with all related html files of page.com? urlretrieve是否处理类似于wget -r的内容，让我下载整个网页结构以及page.com的所有相关html文件？

Regards 问候

Answer 1

Not directly. 不是直接的。

If you want to spider over an entire site, look at mechanize: http://wwwsearch.sourceforge.net/mechanize/ 如果你想在整个网站上蜘蛛，请看机械化： http ： //wwwsearch.sourceforge.net/mechanize/

This will let you load a page and follow links from it 这将允许您加载页面并从中跟踪链接

Something like: 就像是：

import mechanize
br = mechanize.Browser()
br.open('http://stackoverflow.com')
for link in br.links():
    print(link)
    response = br.follow_link(link)
    html = response.read()
    #save your downloaded page
    br.back()

As it stands, this will only get you the pages one link away from your starting point. 就目前而言，这只会让你的页面远离起点。 You could easily adapt it to cover an entire site, though. 但是，您可以轻松地将其调整为覆盖整个站点。

If you really just want to mirror an entire site, use wget. 如果您真的只想镜像整个站点，请使用wget。 Doing this in python is only worthwhile if you need to do some kind of clever processing (handling javascript, selectively following links, etc) 如果你需要做一些聪明的处理（处理javascript，选择性地跟踪链接等），在python中这样做是值得的。

Python - 整个网页的urlretrieve

问题描述

1 个解决方案

解决方案1
1 2012-03-29 19:02:28

Python - 整个网页的urlretrieve

问题描述

1 个解决方案

解决方案1 1 2012-03-29 19:02:28

解决方案1
1 2012-03-29 19:02:28