简体   繁体   English

Web 使用 requests 模块抓取卡住而没有切割 kernel python

[英]Web scraping using requests module gets stuck without cutting kernel python

I am web scraping content from a list of urls and printing the text in python using the bs4 and requests module.我是 web 从 url 列表中抓取内容并使用bs4requests模块在 python 中打印文本。 The problem is that the scraping always gets stuck on a random url without cutting.问题是刮擦总是卡在随机的 url 上而没有切割。

Furthermore when I manually cut with ctrl c it doesn't properly cut as I can't seem to run any other code, as if something is going on in the background.此外,当我使用ctrl c手动剪切时,它没有正确剪切,因为我似乎无法运行任何其他代码,就好像在后台发生了什么事情一样。

Before I scrape I check the response code is 200 and the code looks as such (this is the url it seems to get stuck on this time):在我之前,我检查了响应代码是 200,代码看起来像这样(这是 url,这次它似乎卡住了):

url = 'https://www.businessinsider.in/business/ecommerce/news/amazon-is-eyeing-india-startups-as-it-gears-up-for-a-fight-with-asia-richest-man-in-retail/articleshow/81773692.cms?utm_campaign=cityfalcon&utm_medium=cityfalcon&utm_source=cityfalcon'

response = requests.get(url)
if str(response) == '<Response [200]>':
    report = BeautifulSoup(response.content, 'lxml').text
print(report)

Does the requests module have a certain amount of times you can use it within an hour? requests 模块在一小时内有一定的使用次数吗? Wouldn't anyone know how I could start to go about debugging a problem like this without any error?没有人知道我如何开始 go 关于调试这样的问题而没有任何错误?

Any further clarifications needed or code needed please let me know.需要任何进一步的说明或需要的代码,请告诉我。

EDIT编辑

This has happened again with a different URL.使用不同的 URL 再次发生这种情况。 The response code was <Response [403]> for url= https://www.investing.com/news/stephens-stick-to-their-buy-rating-for-tyson-foods-2470535?utm_campaign=cityfalcon&utm_medium=cityfalcon&utm_source=cityfalcon url= https://www.investing.com/news/stephens-stick-to-their-buy-rating-for-tyson-foods-2470535?utm_campaign=cityfalcon&utm_medium=cityfalcon&utm_source的响应代码是<Response [403]> =城市猎鹰

It also has got stuck on this one:它也卡在了这个上:

<Response [200]> https://www.benzinga.com/analyst-ratings/analyst-color/21/04/20568999/analysts-upgraded-amc-snap-united-airlines-and-tesla-in-the-past-week' <Response [200]> https://www.benzinga.com/analyst-ratings/analyst-color/21/04/20568999/analysts-upgraded-amc-snap-united-airlines-and-tesla-in-the-上周'

Once again it won't let me interrupt and continue working.再一次,它不会让我打断并继续工作。 For the 403 error, it should just bypass the condition anyway so I really do not understand as there is no error it just keeps running?对于 403 错误,它应该只是绕过条件,所以我真的不明白,因为没有错误它只是继续运行?

I run it about 100 times, without any exceptions.我运行了大约 100 次,没有任何例外。

but one note, it's not the pythonic way to check response status code like this但请注意,这不是像这样检查响应状态代码的pythonic方法

if str(response) == '<Response [200]>':

Use this code instead.请改用此代码。

if response.status_code == 200:
    # Do staff

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM