Python - urlretrieve for entire web page

Question

with urllib.urlretrieve('http://page.com', 'page.html') I can save the index page and only the index of page.com. Does urlretrieve handle something similar to wget -r that let's me download the entire web page structure with all related html files of page.com?

Regards

Answer 1

Not directly.

If you want to spider over an entire site, look at mechanize: http://wwwsearch.sourceforge.net/mechanize/

This will let you load a page and follow links from it

Something like:

import mechanize
br = mechanize.Browser()
br.open('http://stackoverflow.com')
for link in br.links():
    print(link)
    response = br.follow_link(link)
    html = response.read()
    #save your downloaded page
    br.back()

As it stands, this will only get you the pages one link away from your starting point. You could easily adapt it to cover an entire site, though.

If you really just want to mirror an entire site, use wget. Doing this in python is only worthwhile if you need to do some kind of clever processing (handling javascript, selectively following links, etc)

Python - urlretrieve for entire web page

Question

1 answers

solution1
1 2012-03-29 19:02:28

Python - urlretrieve for entire web page

Question

1 answers

solution1 1 2012-03-29 19:02:28

solution1
1 2012-03-29 19:02:28