简体   繁体   中英

Downloading all links on a webpage using Mechanize in Python

I was trying to follow the following thread which seemed to answer my question. It serves as a great example that shows how to download all links on a webpage using Mechanize:

Download all the links(related documents) on a webpage using Python

I followed the code that was posted (ie):

import mechanize
from time import sleep
#Make a Browser (think of this as chrome or firefox etc)
br = mechanize.Browser()

#visit http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
#for more ways to set up your br browser object e.g. so it look like mozilla
#and if you need to fill out forms with passwords.

# Open your site
br.open('http://pypi.python.org/pypi/xlwt')

f=open("source.html","w")
f.write(br.response().read()) #can be helpful for debugging maybe

filetypes=[".zip",".exe",".tar.gz"] #you will need to do some kind of pattern matching on your files
myfiles=[]
for l in br.links(): #you can also iterate through br.forms() to print forms on the page!
    for t in filetypes:
        if t in str(l): #check if this link has the file extension we want (you may choose to use reg expressions or something)
            myfiles.append(l)


def downloadlink(l):
    f=open(l.text,"w") #perhaps you should ensure that file doesn't already exist.

    br.click_link(l)
    f.write(br.response().read())
    print l.text," has been downloaded"
    #br.back()

for l in myfiles:
    sleep(1) #throttle so you dont hammer the site
    downloadlink(l)

i only changed:

f=open(l.text,"w") #perhaps you should open in a better way & ensure that file doesn't already exist.

To:

f=open('C:\\l.text',"w") #perhaps you should open in a better way & ensure that file doesn't already exist.

That made the code work for me, else it was giving me an error. When i run the code, i get the following output:

Download> xlwt-0.7.5.tar.gz has been downloaded 
xlwt-0.7.5.tar.gz has been downloaded

So it worked. But i have no idea where this file was downloaded to? Any ideas? I have searched my C drive, and could not find it.

If the code is run as:

f=open(l.text,"w")

It raises the following exception:

Traceback (most recent call last):
  File "C:\Python27\mech.py", line 33, in <module>
downloadlink(l)
  File "C:\Python27\mech.py", line 25, in downloadlink
f=open(l.text,"w") #perhaps you should ensure that file doesn't already exist.
IOError: [Errno 22] invalid mode ('w') or filename: 'Download> <span style="font-size: 75%">xlwt-0.7.5.tar.gz<span>'

The Python code you quoted uses the text attribute of the link l (hence the expression l.text ) as the filename. Consequently (since each link should hopefully have a different text attribute value) the code should produce a number of files, one for each link.

Your change replaces a variable expression (one which has a different value for each link) with a constant. So each file is being written to the C:\\ directory as l.text . Consequently when you look at this file you should see the contexts of the last link on the page.

(By the way, not your fault I know, but l is a very bad name for a variable due to its potential for confusion with the digit one).

The correct way to run this program is inside an empty directory (otherwise the individual files will be hard to track down) on which you have write permission. If any of the filenames contain slashes then you will have to take special pains to either create the necessary directory structure or transform them somehow into acceptable Windows filenames.

You may also wish to replace the detection code with something a little more colloquial.

for l in br.links(): #you can also iterate through br.forms() to print forms on the page!
    s = str(l)
    if any(s.endswith(t) for t in filetypes):
        myfiles.append(l)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM