简体   繁体   English

使用Python中的Mechanize在网页上下载所有链接

[英]Downloading all links on a webpage using Mechanize in Python

I was trying to follow the following thread which seemed to answer my question. 我试图遵循以下似乎可以回答我问题的话题。 It serves as a great example that shows how to download all links on a webpage using Mechanize: 它是一个很好的示例,展示了如何使用Mechanize下载网页上的所有链接:

Download all the links(related documents) on a webpage using Python 使用Python在网页上下载所有链接(相关文档)

I followed the code that was posted (ie): 我遵循发布的代码(即):

import mechanize
from time import sleep
#Make a Browser (think of this as chrome or firefox etc)
br = mechanize.Browser()

#visit http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
#for more ways to set up your br browser object e.g. so it look like mozilla
#and if you need to fill out forms with passwords.

# Open your site
br.open('http://pypi.python.org/pypi/xlwt')

f=open("source.html","w")
f.write(br.response().read()) #can be helpful for debugging maybe

filetypes=[".zip",".exe",".tar.gz"] #you will need to do some kind of pattern matching on your files
myfiles=[]
for l in br.links(): #you can also iterate through br.forms() to print forms on the page!
    for t in filetypes:
        if t in str(l): #check if this link has the file extension we want (you may choose to use reg expressions or something)
            myfiles.append(l)


def downloadlink(l):
    f=open(l.text,"w") #perhaps you should ensure that file doesn't already exist.

    br.click_link(l)
    f.write(br.response().read())
    print l.text," has been downloaded"
    #br.back()

for l in myfiles:
    sleep(1) #throttle so you dont hammer the site
    downloadlink(l)

i only changed: 我只改变了:

f=open(l.text,"w") #perhaps you should open in a better way & ensure that file doesn't already exist.

To: 至:

f=open('C:\\l.text',"w") #perhaps you should open in a better way & ensure that file doesn't already exist.

That made the code work for me, else it was giving me an error. 那使代码对我有用,否则给我一个错误。 When i run the code, i get the following output: 当我运行代码时,我得到以下输出:

Download> xlwt-0.7.5.tar.gz has been downloaded 
xlwt-0.7.5.tar.gz has been downloaded

So it worked. 这样就行了。 But i have no idea where this file was downloaded to? 但是我不知道这个文件下载到哪里了? Any ideas? 有任何想法吗? I have searched my C drive, and could not find it. 我已经搜索了我的C盘,但找不到它。

If the code is run as: 如果代码运行为:

f=open(l.text,"w")

It raises the following exception: 它引发以下异常:

Traceback (most recent call last):
  File "C:\Python27\mech.py", line 33, in <module>
downloadlink(l)
  File "C:\Python27\mech.py", line 25, in downloadlink
f=open(l.text,"w") #perhaps you should ensure that file doesn't already exist.
IOError: [Errno 22] invalid mode ('w') or filename: 'Download> <span style="font-size: 75%">xlwt-0.7.5.tar.gz<span>'

The Python code you quoted uses the text attribute of the link l (hence the expression l.text ) as the filename. 您引用的Python代码使用链接ltext属性(因此表达式l.text )作为文件名。 Consequently (since each link should hopefully have a different text attribute value) the code should produce a number of files, one for each link. 因此(由于每个链接希望有一个不同的text属性值),代码应生成许多文件,每个链接一个。

Your change replaces a variable expression (one which has a different value for each link) with a constant. 您所做的更改用一个常量替换了一个变量表达式(每个链接的值都不同)。 So each file is being written to the C:\\ directory as l.text . 因此,每个文件都以l.text形式写入C:\\目录。 Consequently when you look at this file you should see the contexts of the last link on the page. 因此,当您查看此文件时,您应该在页面上看到最后一个链接的上下文。

(By the way, not your fault I know, but l is a very bad name for a variable due to its potential for confusion with the digit one). (顺便说一句,我不知道是您的错,但是l对于变量来说是一个非常不好的名字,因为它可能与数字混淆。)

The correct way to run this program is inside an empty directory (otherwise the individual files will be hard to track down) on which you have write permission. 运行此程序的正确方法是在您具有写许可权的空目录内(否则将很难跟踪单个文件)。 If any of the filenames contain slashes then you will have to take special pains to either create the necessary directory structure or transform them somehow into acceptable Windows filenames. 如果任何文件名包含斜杠,那么您将不得不特别努力来创建必要的目录结构或将它们以某种方式转换为可接受的Windows文件名。

You may also wish to replace the detection code with something a little more colloquial. 您可能还希望用口语化的方式替换检测代码。

for l in br.links(): #you can also iterate through br.forms() to print forms on the page!
    s = str(l)
    if any(s.endswith(t) for t in filetypes):
        myfiles.append(l)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM