简体   繁体   English

网页抓取:HTTPError:HTTP 错误 403:禁止,python3

[英]Web scraping: HTTPError: HTTP Error 403: Forbidden, python3

Hi I am need to scrape web page end extract data-id use Regular expression嗨,我需要抓取网页端提取数据 ID 使用正则表达式

Here is my code :这是我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://clarity-project.info/tenders/?entity=38163425&offset=100")
bsObj = BeautifulSoup(html,"html.parser")
DataId = bsObg.findAll("data-id", {"skr":re.compile("data-id=[0-9,a-f]")})
for DataId in DataId:
    print(DataId["skr"])

when I run my program in Jupyter :当我在Jupyter运行我的程序时:

HTTPError: HTTP Error 403: Forbidden

It looks like the web server is asking you to authenticate before serving content to Python's urllib.看起来 Web 服务器要求您在向 Python 的 urllib 提供内容之前进行身份验证。 However, they serve everything neatly to wget and curl and https://clarity-project.info/robots.txt doesn't seem to exist, so I reckon scraping as such is fine with them.但是,它们为wgetcurl提供了一切服务,并且https://clarity-project.info/robots.txt似乎不存在,所以我认为这样的抓取对它们来说很好。 Still, it might be a good idea to ask them first.不过,先询问他们可能是个好主意。

As for the code, simply changing the User Agent string to something they like better seems to work:至于代码,简单地将用户代理字符串更改为他们更喜欢的内容似乎可以工作:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib.request import urlopen, Request

request = Request(
    'https://clarity-project.info/tenders/?entity=38163425&offset=100',
    headers={
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0'})

html = urlopen(request).read().decode()

(unrelated, you have another mistake in your code: bsObj ≠ bsObg) (无关,您的代码中还有另一个错误:bsObj ≠ bsObg)

EDIT added code below to answer additional question from the comments:编辑在下面添加了代码以回答评论中的其他问题:

What you seem to need is to find the value of the data-id attribute, no matter to which tag it belongs.您似乎需要的是找到 data-id 属性的值,无论它属于哪个标签。 The code below does just that:下面的代码就是这样做的:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib.request import urlopen, Request
from bs4 import BeautifulSoup

url = 'https://clarity-project.info/tenders/?entity=38163425&offset=100'
agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36\
(KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'

request = Request(url, headers={'User-Agent': agent})

html = urlopen(request).read().decode()

soup = BeautifulSoup(html, 'html.parser')

tags = soup.findAll(lambda tag: tag.get('data-id', None) is not None)
for tag in tags:
    print(tag['data-id'])

The key is to simply use a lambda expression as the parameter to the findAll function of BeautifulSoup.关键是简单地使用一个lambda表达式作为 BeautifulSoup 的findAll函数的参数。

The server is likely blocking your requests because of the default user agent.由于默认用户代理,服务器可能会阻止您的请求。 You can change this so that you will appear to the server to be a web browser.您可以更改此设置,以便在服务器上显示为 Web 浏览器。 For example, a Chrome User-Agent is:例如,Chrome 用户代理是:

Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36 

To add a User-Agent you can create a request object with the url as a parameter and the User-Agent passed in a dictionary as the keyword argument 'headers'.要添加用户代理,您可以创建一个请求对象,以 url 作为参数,用户代理在字典中作为关键字参数“headers”传递。

See:见:

import urllib.request
r = urllib.request.Request(url, headers= {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
html = urllib.request.urlopen(r)

You could try with this:你可以试试这个:

#!/usr/bin/env python

from bs4 import BeautifulSoup
import requests 

url = 'your url here'
soup = BeautifulSoup(requests.get(url).text,"html.parser")

for i in soup.find_all('tr', attrs={'class':'table-row'}):
    print '[Data id] => {}'.format(i.get('data-id'))

This should work!这应该有效!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM