简体   繁体   中英

Web Scraping Error (HTTP Error 403: Forbidden)

I am trying to make a simple program which gets all the image addresses on a website, and then downloads them into a folder. The problem is that I get a 403 Error. I have been trying to fix it for over and hour and desperately need help. Here is my code:

import urllib.request
import requests
from bs4 import BeautifulSoup



url = 'https://www.webtoons.com/en/slice-of-life/how-to-love/ep-100-happy-ending-last-episode/viewer?title_no=472&episode_no=100'
data = requests.get(url)
code = BeautifulSoup(data.text, 'html.parser')




photos = []

def dl_jpg(url, filePath, fileName):
    fullPath = filePath + fileName + '.jpg'
    urllib.request.urlretrieve(url, fullPath)

for img in code.find('div', id='_imageList'):
    pic = str(img)[43:147]
    photos.append(str(pic))

for photo in photos:
    if photo == '':
        photos.remove(photo)

for photo in photos[0:-4]:
    dl_jpg(photo, 'images/', 'img')

Websites often block requests that do not have a user-agent. I updated your code to send a user-agent along with the request. I also chose to just use the requests library and ditch urllib . While urllib does support altered headers, you were already using requests and I am more familiar with it.

I also suggest adding a delay/sleep between requests, 30-45 seconds is a good amount. This will avoid spamming the website and creating a denial of service. Some sites will also block your requests if you send too many too quickly.

import requests
from bs4 import BeautifulSoup

user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.37"
url = 'https://www.webtoons.com/en/slice-of-life/how-to-love/ep-100-happy-ending-last-episode/viewer?title_no=472&episode_no=100'
data = requests.get(url, headers={'User-Agent': user_agent})
code = BeautifulSoup(data.text, 'html.parser')

photos = []

def dl_jpg(url, filePath, fileName):
    fullPath = filePath + fileName + '.jpg'

    # make request with user-agent. If request is successful then save the result.
    image_request = requests.get(url, headers={'User-Agent': user_agent})
    if image_request.status_code == 200:
        with open(fullPath, 'wb') as f:
            f.write(image_request.content)

for img in code.find('div', id='_imageList'):
    pic = str(img)[43:147]
    photos.append(str(pic))

for photo in photos:
    if photo == '':
        photos.remove(photo)

for photo in photos[0:-4]:
    dl_jpg(photo, 'images/', 'img')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM