简体   繁体   English

Python中的Scraper给出了“拒绝访问”

[英]Scraper in Python gives “Access Denied”

I'm trying to code a scraper in Python to get some info from a page. 我正在尝试用Python编写一个刮刀来从页面获取一些信息。 Like the title of the offers that appear on this page: 与此页面上显示的优惠标题一样:
https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585 https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585

By now I use this code : 到现在为止我使用这段代码:

import bs4
import requests

def extract_source(url):
    source=requests.get(url).text
    return source

def extract_data(source):
    soup=bs4.BeautifulSoup(source)
    names=soup.findAll('title')
    for i in names:
        print i

extract_data(extract_source('https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585'))

But when I execute this code, it gives me an error: 但是当我执行这段代码时,它给了我一个错误:

<titlee> Access Denied</titlee>

What can I do to solve this? 我该怎么做才能解决这个问题?

As was mentioned in comments, you need to specify allowable user-agent and pass it as headers : 正如评论中提到的,您需要指定允许的用户代理并将其作为headers传递:

def extract_source(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
    source=requests.get(url, headers=headers).text
    return source
def extract_source(url):
    headers = {"User-Agent":"Mozilla/5.0"}
    source=requests.get(url, headers=headers).text
    return source

out: 出:

<title>Saree Retailers in Panipat - Best Deals online - Justdial</title>

Add User-Agent to your request, some site do not response to the request which dnose not has User-Agent User-Agent添加到您的请求中,某些站点不响应没有User-Agent的请求

Try this: 试试这个:

import bs4
import requests

def extract_source(url):
     agent = {"User-Agent":"Mozilla/5.0"}
     source=requests.get(url, headers=agent).text
     return source

def extract_data(source):
     soup=bs4.BeautifulSoup(source, 'lxml')
     names=soup.findAll('title')
     for i in names:
     print i

extract_data(extract_source('https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585'))

I added 'lxml' to potentially avoid parse error. 我添加了'lxml'以避免解析错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM