简体   繁体   English

MaxRetryError 而 web 抓取解决方法 - Python、Selenium

[英]MaxRetryError while web scraping workaround - Python, Selenium

I am having some really hard times trying to figure out how to webscrape making multiple requests to the same website.我很难弄清楚如何对同一个网站发出多个请求。 I have to web scrape 3000 products from a website.我必须 web 从网站上抓取 3000 种产品。 That implies making various requests to that server (for example searching the product, clicking on it, going back to the home page) 3000 times.这意味着向该服务器发出各种请求(例如搜索产品、单击产品、返回主页)3000 次。 I state that I am using Selenium.我使用 state,我正在使用 Selenium。 If I only launch one instance of my Firefox webdriver I don't get a MaxRetryError, but as the search goes on my webdriver gets slower and slower, and when the program reaches about half of the searches it stops responding.如果我只启动我的 Firefox webdriver 的一个实例,我不会收到 MaxRetryError,但随着搜索的进行,我的 webdriver 变得越来越慢,并且当程序达到大约一半的搜索时它停止响应。 I looked it up on some forums and I found out it does so for some browser memory issues.我在一些论坛上查找了它,发现它适用于某些浏览器 memory 问题。 So I tried quitting and reinstantiating the webdriver every n seconds (I tried with 100, 200 and 300 secs), but when I do so I get that MaxRetryError because of the too many requests to that url using the same session.因此,我尝试每隔 n 秒退出并重新实例化 webdriver(我尝试了 100、200 和 300 秒),但是当我这样做时,由于使用相同的 url 对 url 的请求太多,我得到 MaxRetryError。 I then tried making the program sleep for a minute when the exception occurs but that hasn't worked (I am only able to make another search and then an exception is again thrown, and so on).然后,我尝试在发生异常时让程序休眠一分钟,但没有奏效(我只能进行另一次搜索,然后再次抛出异常,依此类推)。 I am wondering if there is any workaround for these kind of issue.我想知道这些问题是否有任何解决方法。 It might be using another library, a way for changing IP or session dynamically or something like that.它可能正在使用另一个库,一种动态更改 IP 或 session 或类似的方法。 PS I would rather keep working with selenium if possible. PS 如果可能的话,我宁愿继续使用 selenium。

This error is normally raised if the server determines a high request rate from your client.如果服务器确定来自您的客户端的请求率很高,通常会引发此错误。

As you mentioned, the server bans your IP from making further requests so you can get around that by using some available technologies.正如您所提到的,服务器禁止您的IP发出进一步的请求,因此您可以通过使用一些可用的技术来解决这个问题。 Look into Zalenium and also see here for some other possible ways.查看Zalenium在此处查看其他一些可能的方法。

Another possible (but tedious) way is to use a number of browser instances to make the call, for example, an answer from here illustrates that.另一种可能(但乏味)的方法是使用多个浏览器实例进行调用,例如,此处的答案说明了这一点。

urlArr = ['https://link1', 'https://link2', '...']

for url in urlArr:
   chrome_options = Options()  
   chromedriver = webdriver.Chrome(executable_path='C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe', options=chrome_options)
   with chromedriver as browser:
      browser.get(url)
      # your task
      chromedriver.close() # will close only the current chrome window.

browser.quit() # should close all of the open windows,

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM