简体   繁体   中英

urlopen of urllib.request cannot open a page in python 3.7

I want to write webscraper to collect titles of articles from Medium.com webpage.

I am trying to write a python script that will scrape headlines from Medium.com website. I am using python 3.7 and imported urlopen from urllib.request . But it cannot open the site and shows

"urllib.error.HTTPError: HTTP Error 403: Forbidden" error.
from bs4 import BeautifulSoup
from urllib.request import  urlopen

webAdd = urlopen("https://medium.com/")
bsObj = BeautifulSoup(webAdd.read())
Result = urllib.error.HTTPError: HTTP Error 403: Forbidden

Expected result is that it will not show any error and just read the web site.

But this does not happen when I use requests module.

import requests 
from bs4 import BeautifulSoup 
url = 'https://medium.com/' 
response = requests.get(url, timeout=5)

This time around it works without error.

Why ??

Urllib is pretty old and small module. For webscraping, requests module is recommended. You can check out this answer for additional information.

Many sites nowadays check where the user agent is coming from, to try and deter bots. requests is the better module to use, but if you really want to use urllib , you can alter the headers text, to pretend to be Firefox or something else, so that it is not blocked. Quick example can be found here:

https://stackoverflow.com/a/16187955

import urllib.request

user_agent = 'Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion'

url = "http://example.com"
request = urllib.request.Request(url)
request.add_header('User-Agent', user_agent)
response = urllib.request.urlopen(request)

You will need to alter the user_agent string with the appropriate versions of things too. Hope this helps.

this worked for me

import urllib 
from urllib.request import urlopen
html = urlopen(MY_URL)
contents = html.read()
print(contents)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM