简体   繁体   English

无法抓取非HTML元素

[英]Can't scrape non-html elements

I'm trying to scrape search results from a number of websites. 我正在尝试从多个网站抓取搜索结果。 The problem is that not all of these sites return their search results as plain html text, a lot of it is dynamically generated with with JS, AJAX, etc. However, I can see exactly what I need by looking at the page with the Firefox inspector, since the scripts have all run and modified the html. 问题在于,并非所有这些网站都以纯HTML文本形式返回其搜索结果,其中很多是使用JS,AJAX等动态生成的。但是,通过使用Firefox查看页面,我可以确切地看到我需要的内容检查器,因为脚本均已运行并修改了html。

My question is: is there a way for me to download a webpage AFTER allowing the scripts to run, or at least get them to run locally. 我的问题是:有没有办法让我在脚本运行后下载网页,或者至少让它们在本地运行。 That way, I'd get the final html. 这样,我将获得最终的html。

For reference, I'm using python. 供参考,我使用的是python。

Possible duplicate . 可能重复 In that case the question is with php and JS. 在那种情况下,问题在于php和JS。

Sure, you have to provide some enviroment for scripts (js) to run and often to return a test value to target server. 当然,您必须为脚本(js)提供运行所需的环境,并经常将测试值返回给目标服务器。 It's not that easy for the server side languages. 服务器端语言并不是那么容易。 So today for this we mostly leverage browser driving or imitating tools mentioned there. 因此,今天我们主要利用那里提到的浏览器驱动或模仿工具。

I've found for you the python analog to v8js php plugin : PyV8 . 我为您找到了v8js php插件的python类似物: PyV8

PyV8 is a python wrapper for Google V8 engine, it act as a bridge between the Python and JavaScript objects, and support to hosting Google's v8 engine in a python script. PyV8是Google V8引擎的python包装器,它充当Python和JavaScript对象之间的桥梁,并支持在Python脚本中托管Google的v8引擎。

If properly configured, your scraper: 如果配置正确,您的刮板将:

  1. Gets site's js 获取网站的js
  2. Evaluates this js thru the given plugin 通过给定的插件评估此js
  3. Gets access to target html for further parse. 获取对目标html的访问以进行进一步解析。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM