![](/img/trans.png)
[英]Beautiful Soup scraper gives "Access Denied" even though user-agent string is specified
[英]Scraper in Python gives “Access Denied”
我正在嘗試用Python編寫一個刮刀來從頁面獲取一些信息。 與此頁面上顯示的優惠標題一樣:
https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585
到現在為止我使用這段代碼:
import bs4
import requests
def extract_source(url):
source=requests.get(url).text
return source
def extract_data(source):
soup=bs4.BeautifulSoup(source)
names=soup.findAll('title')
for i in names:
print i
extract_data(extract_source('https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585'))
但是當我執行這段代碼時,它給了我一個錯誤:
<titlee> Access Denied</titlee>
我該怎么做才能解決這個問題?
正如評論中提到的,您需要指定允許的用戶代理並將其作為headers
傳遞:
def extract_source(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
source=requests.get(url, headers=headers).text
return source
def extract_source(url):
headers = {"User-Agent":"Mozilla/5.0"}
source=requests.get(url, headers=headers).text
return source
出:
<title>Saree Retailers in Panipat - Best Deals online - Justdial</title>
將User-Agent
添加到您的請求中,某些站點不響應沒有User-Agent的請求
試試這個:
import bs4
import requests
def extract_source(url):
agent = {"User-Agent":"Mozilla/5.0"}
source=requests.get(url, headers=agent).text
return source
def extract_data(source):
soup=bs4.BeautifulSoup(source, 'lxml')
names=soup.findAll('title')
for i in names:
print i
extract_data(extract_source('https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585'))
我添加了'lxml'以避免解析錯誤。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.