简体   繁体   中英

Python - urlretrieve for entire web page

with urllib.urlretrieve('http://page.com', 'page.html') I can save the index page and only the index of page.com. Does urlretrieve handle something similar to wget -r that let's me download the entire web page structure with all related html files of page.com?

Regards

Not directly.

If you want to spider over an entire site, look at mechanize: http://wwwsearch.sourceforge.net/mechanize/

This will let you load a page and follow links from it

Something like:

import mechanize
br = mechanize.Browser()
br.open('http://stackoverflow.com')
for link in br.links():
    print(link)
    response = br.follow_link(link)
    html = response.read()
    #save your downloaded page
    br.back()

As it stands, this will only get you the pages one link away from your starting point. You could easily adapt it to cover an entire site, though.

If you really just want to mirror an entire site, use wget. Doing this in python is only worthwhile if you need to do some kind of clever processing (handling javascript, selectively following links, etc)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM