[英]Does the browser instance from mechanize cache?
I am doing some webscraping with the mechanize browser and using the following code. 我正在使用机械化浏览器进行一些网络爬虫,并使用以下代码。 I realized in some cases we keep getting the same page, although the remote page is already changed. 我意识到在某些情况下,尽管远程页面已经更改,但我们仍会获得相同的页面。 So my question is: 所以我的问题是:
If so, how can we change it, or is there a way to avoid caching (apart from creating the browser instance every time in the loop we webscrape) 如果是这样,我们如何更改它,或者有一种避免缓存的方法(除了每次在Webscrape循环中创建浏览器实例之外)
# put int login detail and submit, return a mechanize.Browser instance browser = _login() # main loop while True: rsp = browser.open(URL) html = rsp.read()
thanks 谢谢
According to this thread , 根据这个线程 ,
Mechanize instances do cache pages you've visited, but you can clear that with agent.history.clear; 机械化实例确实会缓存您访问过的页面,但是您可以使用agent.history.clear;清除它。 or prevent history from being saved by setting agent.history.max_size = 0. Or, you can use a new Mechanize instance altogether. 或通过设置agent.history.max_size = 0阻止保存历史记录。或者,您可以完全使用新的Mechanize实例。
Particularly, 尤其,
Currently Mechanize reuses pages in the history of the session if a request with an If-Modified-Since header results in 304 Not Modified. 当前,如果带有If-Modified-Since标头的请求导致304 Not Modified,则Mechanize重用会话历史记录中的页面。
And by the documentation here , in Python, the following code will prevent the caching-like behavior (seekable responses): 并且通过此处的文档(在Python中),以下代码将防止类似缓存的行为(可寻求的响应):
import mechanize
ua = mechanize.UserAgent()
ua.set_seekable_responses(False)
ua.set_handle_equiv(False)
ua.set_debug_responses(False)
Hope that provides some insight. 希望能提供一些见识。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.