繁体   English   中英

尝试使用urllib.request从网站下载图像,但出现403错误。 如何更改用户代理?

[英]Trying to download images from website with urllib.request but I'm getting a 403 error. How do I change the user agent?

我编写了一个程序,可以转到特定的网站,并找到图像URL,标题和上一个按钮URL。 这是使用request和bs4模块完成的。 我的网址正确,但一切似乎都无法下载。 我不断收到403错误以及引发的其他一些异常。 我知道通常会发生403错误,因为该网站检测到您没有使用浏览器,所以我上网尝试了一下如何更改用户代理。 我已经在程序中编写了一些代码来执行此操作,但是我不确定是否正确执行了此操作,因为有关此问题的许多堆栈交换教程/帖子都使用了urllib2或只是简单的urllib,这些都已合并在一起进入urllib.request。

这是代码:

   import bs4, requests, urllib.request, re, os

os.chdir(r'c:\Users\Adam\Desktop\pythonprogs\GuitarPics')   #change this to your current directory if needed

#this first chunk finds the comic number based on the html code from the page
url = 'http://www.guitargeek.com/michael-wilton-queensryche-guitar-rig-and-gear-setup-2007/'
res = requests.get(url)     #open initial  webpage
res.raise_for_status()      #raise exception if page doesnt work          

imageSoup = bs4.BeautifulSoup(res.text, 'html.parser')  #parse html from page
search = True

#the comic number is then used to loop through the pages and extract the image and title and then save it to a folder.

while search == True:


    res = requests.get(url)          #open initial xkcd webpage
    res.raise_for_status()
    imageSoup = bs4.BeautifulSoup(res.text, 'html.parser')  #parse html from page

    try:

        prevElem = imageSoup.select('#wrapper_main > div.rigview_nav_middle > div > div > span.pref > a')
        url = prevElem[0].get('href')

        imageElem = imageSoup.select('#entry > p > a > img')                      #get image element from html
        imageAttrs = imageElem[0].attrs
        imageURL = imageAttrs['src']

        titleElem = imageSoup.select('#content > div > h1')                 #finds comic title element
        title = titleElem[0].text.strip()                     #strips title element to just be title



    except:
        print('Not able to find image source')
        search = False #Sometimes this problem happens when there's no image on the page
        print(url)
        print(imageURL)
        continue

    if os.path.isfile(os.path.basename(imageURL)) == False:                 #if the image does not exist in folder, download it

        #try:
            class AppURLopener(urllib.request.FancyURLopener):
                version = "Mozilla/5.0"

            urllib._urlopener = AppURLopener()

            #Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
            urllib.request.URLopener.version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 SE 2.X MetaSr 1.0'
            resource = urllib.request.urlopen(imageURL)
            output = open(os.path.basename(imageURL),"wb")
            output.write(resource.read())
            output.close()
            print('Image ' + title + ' downloaded')

        #except:
            print('Failed to download this one, not an image?')     #Sometimes the file isn't an image and urllib fails to download it
            continue
    else:
        print('You already have this image (' + title + ')')                        


print('Finished.' + ' All images were downloaded to: ' + os.getcwd())

我特别对此有疑问:

 class AppURLopener(urllib.request.FancyURLopener):
            version = "Mozilla/5.0"

        urllib._urlopener = AppURLopener()

        #Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
        urllib.request.URLopener.version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 SE 2.X MetaSr 1.0'
        resource = urllib.request.urlopen(imageURL)
        output = open(os.path.basename(imageURL),"wb")
        output.write(resource.read())
        output.close()

那么,如何使用urllib.request成功更改用户代理?

对于将来遇到相同问题的用户,这是我为解决此问题所做的工作。

from urllib.request import FancyURLopener

class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/2007127 Firefox/2.0.0.11'
myopener=MyOpener()
myopener.retrieve(imageURL, title + '.jpg')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM