网页抓取：HTTPError：HTTP 错误 403：禁止，python3

Question

Hi I am need to scrape web page end extract data-id use Regular expression嗨，我需要抓取网页端提取数据 ID 使用正则表达式

Here is my code :这是我的代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://clarity-project.info/tenders/?entity=38163425&offset=100")
bsObj = BeautifulSoup(html,"html.parser")
DataId = bsObg.findAll("data-id", {"skr":re.compile("data-id=[0-9,a-f]")})
for DataId in DataId:
    print(DataId["skr"])

when I run my program in Jupyter :当我在Jupyter运行我的程序时：

HTTPError: HTTP Error 403: Forbidden

Answer 1

It looks like the web server is asking you to authenticate before serving content to Python's urllib.看起来 Web 服务器要求您在向 Python 的 urllib 提供内容之前进行身份验证。 However, they serve everything neatly to wget and curl and https://clarity-project.info/robots.txt doesn't seem to exist, so I reckon scraping as such is fine with them.但是，它们为wget和curl提供了一切服务，并且https://clarity-project.info/robots.txt似乎不存在，所以我认为这样的抓取对它们来说很好。 Still, it might be a good idea to ask them first.不过，先询问他们可能是个好主意。

As for the code, simply changing the User Agent string to something they like better seems to work:至于代码，简单地将用户代理字符串更改为他们更喜欢的内容似乎可以工作：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib.request import urlopen, Request

request = Request(
    'https://clarity-project.info/tenders/?entity=38163425&offset=100',
    headers={
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0'})

html = urlopen(request).read().decode()

(unrelated, you have another mistake in your code: bsObj ≠ bsObg) （无关，您的代码中还有另一个错误：bsObj ≠ bsObg）

EDIT added code below to answer additional question from the comments:编辑在下面添加了代码以回答评论中的其他问题：

What you seem to need is to find the value of the data-id attribute, no matter to which tag it belongs.您似乎需要的是找到 data-id 属性的值，无论它属于哪个标签。 The code below does just that:下面的代码就是这样做的：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib.request import urlopen, Request
from bs4 import BeautifulSoup

url = 'https://clarity-project.info/tenders/?entity=38163425&offset=100'
agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36\
(KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'

request = Request(url, headers={'User-Agent': agent})

html = urlopen(request).read().decode()

soup = BeautifulSoup(html, 'html.parser')

tags = soup.findAll(lambda tag: tag.get('data-id', None) is not None)
for tag in tags:
    print(tag['data-id'])

The key is to simply use a lambda expression as the parameter to the findAll function of BeautifulSoup.关键是简单地使用一个lambda表达式作为 BeautifulSoup 的findAll函数的参数。

Answer 2

The server is likely blocking your requests because of the default user agent.由于默认用户代理，服务器可能会阻止您的请求。 You can change this so that you will appear to the server to be a web browser.您可以更改此设置，以便在服务器上显示为 Web 浏览器。 For example, a Chrome User-Agent is:例如，Chrome 用户代理是：

Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36

To add a User-Agent you can create a request object with the url as a parameter and the User-Agent passed in a dictionary as the keyword argument 'headers'.要添加用户代理，您可以创建一个请求对象，以 url 作为参数，用户代理在字典中作为关键字参数“headers”传递。

See:见：

import urllib.request
r = urllib.request.Request(url, headers= {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
html = urllib.request.urlopen(r)

Answer 3

You could try with this:你可以试试这个：

#!/usr/bin/env python

from bs4 import BeautifulSoup
import requests 

url = 'your url here'
soup = BeautifulSoup(requests.get(url).text,"html.parser")

for i in soup.find_all('tr', attrs={'class':'table-row'}):
    print '[Data id] => {}'.format(i.get('data-id'))

This should work!这应该有效！

网页抓取：HTTPError：HTTP 错误 403：禁止，python3

问题描述

3 个解决方案

解决方案1
3 2017-09-04 17:22:45

解决方案2
2 2017-09-04 17:26:56

解决方案3
0 2017-09-04 19:37:22

网页抓取：HTTPError：HTTP 错误 403：禁止，python3

问题描述

3 个解决方案

解决方案1 3 2017-09-04 17:22:45

解决方案2 2 2017-09-04 17:26:56

解决方案3 0 2017-09-04 19:37:22

解决方案1
3 2017-09-04 17:22:45

解决方案2
2 2017-09-04 17:26:56

解决方案3
0 2017-09-04 19:37:22