I'm trying to code a scraper in Python to get some info from a page. Like the title of the offers that appear on this page:
https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585
By now I use this code :
import bs4
import requests
def extract_source(url):
source=requests.get(url).text
return source
def extract_data(source):
soup=bs4.BeautifulSoup(source)
names=soup.findAll('title')
for i in names:
print i
extract_data(extract_source('https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585'))
But when I execute this code, it gives me an error:
<titlee> Access Denied</titlee>
What can I do to solve this?
As was mentioned in comments, you need to specify allowable user-agent and pass it as headers
:
def extract_source(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
source=requests.get(url, headers=headers).text
return source
def extract_source(url):
headers = {"User-Agent":"Mozilla/5.0"}
source=requests.get(url, headers=headers).text
return source
out:
<title>Saree Retailers in Panipat - Best Deals online - Justdial</title>
Add User-Agent
to your request, some site do not response to the request which dnose not has User-Agent
Try this:
import bs4
import requests
def extract_source(url):
agent = {"User-Agent":"Mozilla/5.0"}
source=requests.get(url, headers=agent).text
return source
def extract_data(source):
soup=bs4.BeautifulSoup(source, 'lxml')
names=soup.findAll('title')
for i in names:
print i
extract_data(extract_source('https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585'))
I added 'lxml' to potentially avoid parse error.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.