Web 使用 requests 模块抓取卡住而没有切割 kernel python

Question

I am web scraping content from a list of urls and printing the text in python using the bs4 and requests module.我是 web 从 url 列表中抓取内容并使用bs4和requests模块在 python 中打印文本。 The problem is that the scraping always gets stuck on a random url without cutting.问题是刮擦总是卡在随机的 url 上而没有切割。

Furthermore when I manually cut with ctrl c it doesn't properly cut as I can't seem to run any other code, as if something is going on in the background.此外，当我使用ctrl c手动剪切时，它没有正确剪切，因为我似乎无法运行任何其他代码，就好像在后台发生了什么事情一样。

Before I scrape I check the response code is 200 and the code looks as such (this is the url it seems to get stuck on this time):在我刮之前，我检查了响应代码是 200，代码看起来像这样（这是 url，这次它似乎卡住了）：

url = 'https://www.businessinsider.in/business/ecommerce/news/amazon-is-eyeing-india-startups-as-it-gears-up-for-a-fight-with-asia-richest-man-in-retail/articleshow/81773692.cms?utm_campaign=cityfalcon&utm_medium=cityfalcon&utm_source=cityfalcon'

response = requests.get(url)
if str(response) == '<Response [200]>':
    report = BeautifulSoup(response.content, 'lxml').text
print(report)

Does the requests module have a certain amount of times you can use it within an hour? requests 模块在一小时内有一定的使用次数吗？ Wouldn't anyone know how I could start to go about debugging a problem like this without any error?没有人知道我如何开始 go 关于调试这样的问题而没有任何错误？

Any further clarifications needed or code needed please let me know.需要任何进一步的说明或需要的代码，请告诉我。

EDIT编辑

This has happened again with a different URL.使用不同的 URL 再次发生这种情况。 The response code was <Response [403]> for url= https://www.investing.com/news/stephens-stick-to-their-buy-rating-for-tyson-foods-2470535?utm_campaign=cityfalcon&utm_medium=cityfalcon&utm_source=cityfalcon url= https://www.investing.com/news/stephens-stick-to-their-buy-rating-for-tyson-foods-2470535?utm_campaign=cityfalcon&utm_medium=cityfalcon&utm_source的响应代码是<Response [403]> =城市猎鹰

It also has got stuck on this one:它也卡在了这个上：

<Response [200]> https://www.benzinga.com/analyst-ratings/analyst-color/21/04/20568999/analysts-upgraded-amc-snap-united-airlines-and-tesla-in-the-past-week' <Response [200]> https://www.benzinga.com/analyst-ratings/analyst-color/21/04/20568999/analysts-upgraded-amc-snap-united-airlines-and-tesla-in-the-上周'

Once again it won't let me interrupt and continue working.再一次，它不会让我打断并继续工作。 For the 403 error, it should just bypass the condition anyway so I really do not understand as there is no error it just keeps running?对于 403 错误，它应该只是绕过条件，所以我真的不明白，因为没有错误它只是继续运行？

Answer 1

I run it about 100 times, without any exceptions.我运行了大约 100 次，没有任何例外。

but one note, it's not the pythonic way to check response status code like this但请注意，这不是像这样检查响应状态代码的pythonic方法

if str(response) == '<Response [200]>':

Use this code instead.请改用此代码。

if response.status_code == 200:
    # Do staff

Web 使用 requests 模块抓取卡住而没有切割 kernel python

问题描述

1 个解决方案

解决方案1
2 2021-04-01 14:17:00

Web 使用 requests 模块抓取卡住而没有切割 kernel python

问题描述

1 个解决方案

解决方案1 2 2021-04-01 14:17:00

解决方案1
2 2021-04-01 14:17:00