无法使用 Python 和漂亮的汤进行网络抓取

Question

I am trying to do some webscraping (for the Automate the Boring Stuff with Python udemy course) but I keep getting the HTTPError: 403 Client Error: HTTP Forbidden for url: error.我正在尝试做一些网页抓取（用于使用 Python udemy 课程自动化无聊的东西），但我不断收到HTTPError: 403 Client Error: HTTP Forbidden for url:错误。 Here is the code I have been working with:这是我一直在使用的代码：

import bs4
import requests
ro = requests.get('https://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994/')
ro.raise_for_status()

And here's the error message I have been getting:这是我收到的错误消息：

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    ro.raise_for_status()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: HTTP Forbidden for url: https://www.carsales.com.au/cars/details/2012-mazda-3-neo-bl-series-2-auto/SSE-AD-6368302/

I have read online about changing the user agent but I don't understand what that is or how to do that either.我已经在线阅读了有关更改用户代理的信息，但我也不明白那是什么或如何做到这一点。 Can anyone offer some help here?任何人都可以在这里提供一些帮助吗？ I am completely lost and I can't seem to get any webscraping information anywhere.我完全迷失了，我似乎无法在任何地方获得任何网络抓取信息。 I am on Mac if that helps at all.如果这有帮助的话，我在 Mac 上。 Thanks.谢谢。

Answer 1

The requests package allows you to change your user agent, this makes the server think you're a different browser. requests 包允许您更改用户代理，这使服务器认为您是不同的浏览器。

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0'}
ro = requests.get('https://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994/', headers=headers)
ro.raise_for_status()

soup = BeautifulSoup(ro.text, 'html.parser')
print(soup.prettify())

Answer 2

First, I would suggest replacing ro.raise_for_status() by ro.status_code with if statements or a switch-case statment, however, if you want to use ro.raise_for_status() you may want to use it inside try-catch block.首先，我会建议更换ro.raise_for_status()由ro.status_code if语句或开关的情况下statment，但是，如果你想使用ro.raise_for_status()你可能想使用它里面的try-catch块。 Regarding to the error, Amazon seems to block the requests that has default requests module user-agent, to overcome this, you may want to change the user-agent to something like: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36 , for further information about implementing this, please check this page, Using Python Requests section.关于错误，亚马逊似乎阻止了具有默认requests模块用户代理的requests ，为了克服这个问题，您可能需要将用户代理更改为： Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36 ，关于实现这个的更多信息，请查看这个页面，使用 Python 请求部分。

PS: please make sure to check if web scraping Amazon is legal. PS：请务必检查网络抓取亚马逊是否合法。

无法使用 Python 和漂亮的汤进行网络抓取

问题描述

2 个解决方案

解决方案1
1 2019-12-10 14:32:35

解决方案2
0 2019-12-10 14:35:58

无法使用 Python 和漂亮的汤进行网络抓取

问题描述

2 个解决方案

解决方案1 1 2019-12-10 14:32:35

解决方案2 0 2019-12-10 14:35:58

解决方案1
1 2019-12-10 14:32:35

解决方案2
0 2019-12-10 14:35:58