简体   繁体   English

如何在Python中加载网站的所有资源,包括AJAX请求等?

[英]How can I load all of a site's resources, including AJAX requests, etc.. in Python?

I know how to request a web site and read its text with Python. 我知道如何申请网站并使用Python阅读其文本。 In the past, I've tried using a library like BeautifulSoup to make all of the requests to links on a site, but that doesn't get things that don't look like full urls, such as AJAX requests and most requests to the original domain (since the " http://example.com " will be missing, and more importantly, isn't in an <a href='url'>Link</a> format, so BeautifulSoup will miss that). 在过去,我尝试使用像BeautifulSoup这样的库来对网站上的链接发出所有请求,但这并不会产生看起来像完整网址的内容,例如AJAX请求和大多数请求。原始域名(因为“ http://example.com ”将丢失,更重要的是,不是<a href='url'>Link</a>格式,所以BeautifulSoup将会错过)。

How can I load all of a site's resources in Python? 如何在Python中加载网站的所有资源? Will it require interacting with something like Selenium, or is there a way that's not too difficult to implement without that? 它是否需要与像Selenium这样的东西进行交互,或者有没有一种方法在没有它的情况下实现起来并不太难? I haven't used Selenium much, so I'm not sure how difficult that will be. 我没有太多使用Selenium,所以我不确定它会有多难。

Thanks 谢谢

It all depends on what you want and how you want it. 这一切都取决于你想要什么以及你想要它。 The closest that may work for you is 最适合你的是

from ghost import Ghost
ghost = Ghost()
page, extra_resources = ghost.open("http://jeanphi.fr")
assert page.http_status==200 and 'jeanphix' in ghost.content

You can know more on: http://jeanphix.me/Ghost.py/ 您可以了解更多信息: http//jeanphix.me/Ghost.py/

I would love to hear other ways of doing this, especially if they're more concise (easier to remember), but I think this accomplishes my goal. 我很乐意听到其他方法,特别是如果它们更简洁(更容易记住),但我认为这实现了我的目标。 It does not fully answer my original question though--this just gets more of the stuff than using requests.get(url) --which was enough for me in this case`: 它并没有完全回答我原来的问题 - 这只是获得了比使用requests.get(url)更多的东西 - 在这种情况下,这对我来说足够了:

import urllib2
url = 'http://example.com'
headers = {'User-Agent' : 'Mozilla/5.0'}
request = urllib2.Request(url,None,headers)
sock = urllib2.urlopen(request)
ch = sock.read()
sock.close()

Mmm that's a pretty interesting question. 嗯,这是一个非常有趣的问题。 For those resources whose URLs are not fully identifiable due to them being generated at runtime or something like that (such as those used in scripts, not only AJAX) you'd need to actually run the website, so scripts get executed and dynamic URLs get created. 对于那些由于在运行时生成URL而无法完全识别的资源(例如脚本中使用的那些资源,不仅仅是AJAX),您需要实际运行网站,以便执行脚本并获取动态URL创建。

One option is using something like what this answer describes , which is using a third party library, like Qt, to actually run the website. 一种选择是使用类似于此答案描述的内容 ,即使用第三方库(如Qt)来实际运行网站。 To collect all URLs, you need some way of monitoring all requests made by the website, that could be done like this (although it's c++, but the code's essentially the same) . 要收集所有URL,您需要一些方法来监控网站发出的所有请求,这可以这样做(尽管它是c ++,但代码基本相同)

Finally once you have the URL's, you can use something like Requests to download the external resources. 最后,一旦有了URL,就可以使用Requests之类的东西来下载外部资源。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从 2017.0、2018.0、2019.0 等中删除.0? - How can I remove .0 from 2017.0, 2018.0, 2019.0 etc..? 如何在 Python 的 Flask 中识别通过 AJAX 发出的请求? - How can I identify requests made via AJAX in Python's Flask? 如何使用 python 中的请求模块登录站点? - How can I login to the site using requests module in python? 如何在 Python 中查找 DNS,包括引用 /etc/hosts? - How can I do DNS lookups in Python, including referring to /etc/hosts? Python:如何使用 urllib 或从公司域(防火墙、代理、cntlm 等)请求模块 - Python: How can I use urllib or requests modules from a corporate domain (firewall, proxy, cntlm etc) 如何将所有数字类型(整数,八进制等)拆分为个位数? - How to split all number types (int,oct,etc..) into single digits? 如何使用Python的要求抓取超市的营养数据? - How can I scrape supermarket nutrient data with Python's requests? 如何使用python的BaseHTTPServer / SimpleHTTPServer调试POST请求? - How can I debug POST requests with python's BaseHTTPServer / SimpleHTTPServer? 获取我通过 ID 或名称等获得的元素的 Xpath 字符串。 Python Selenium - Get Xpath string of an element that I got by ID or name etc.. Python Selenium 如何根据从Firebug工具获得的请求标头使用python中的请求模块登录到此站点? - How can i login to this site with requests module in python based on the request headers i got from firebug tool?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM