简体   繁体   English

Python - 整个网页的urlretrieve

[英]Python - urlretrieve for entire web page

with urllib.urlretrieve('http://page.com', 'page.html') I can save the index page and only the index of page.com. 使用urllib.urlretrieve('http://page.com', 'page.html') ://page.com','page.html urllib.urlretrieve('http://page.com', 'page.html')我可以保存索引页面,只保存page.com的索引​​。 Does urlretrieve handle something similar to wget -r that let's me download the entire web page structure with all related html files of page.com? urlretrieve是否处理类似于wget -r的内容,让我下载整个网页结构以及page.com的所有相关html文件?

Regards 问候

Not directly. 不是直接的。

If you want to spider over an entire site, look at mechanize: http://wwwsearch.sourceforge.net/mechanize/ 如果你想在整个网站上蜘蛛,请看机械化: http//wwwsearch.sourceforge.net/mechanize/

This will let you load a page and follow links from it 这将允许您加载页面并从中跟踪链接

Something like: 就像是:

import mechanize
br = mechanize.Browser()
br.open('http://stackoverflow.com')
for link in br.links():
    print(link)
    response = br.follow_link(link)
    html = response.read()
    #save your downloaded page
    br.back()

As it stands, this will only get you the pages one link away from your starting point. 就目前而言,这只会让你的页面远离起点。 You could easily adapt it to cover an entire site, though. 但是,您可以轻松地将其调整为覆盖整个站点。

If you really just want to mirror an entire site, use wget. 如果您真的只想镜像整个站点,请使用wget。 Doing this in python is only worthwhile if you need to do some kind of clever processing (handling javascript, selectively following links, etc) 如果你需要做一些聪明的处理(处理javascript,选择性地跟踪链接等),在python中这样做是值得的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM