简体   繁体   English

如何从浏览器中抓取网站?

[英]How to scrape websites from within browser?

I would like to scrape a website by just running code in a browser.我想通过在浏览器中运行代码来抓取网站。 In this case, the scraper has to run on a specific machine, and I cannot install any software on that machine.在这种情况下,刮板必须在特定机器上运行,我无法在该机器上安装任何软件。 However, there is already a browser installed (recent version of Firefox), and I can configure the browser however I want.但是,已经安装了浏览器(最新版本的 Firefox),我可以随意配置浏览器。

What I would like is a javascript solution for scraping, contained in a webpage on site A, that can scrape site B. It seems like this would run into some CORS-type problems;我想要的是一个用于抓取的javascript解决方案,包含在站点A的网页中,可以抓取站点B。看起来这会遇到一些CORS类型的问题; I assume that part of the solution is to disable any cross-origin checks in the browser.我认为解决方案的一部分是禁用浏览器中的任何跨域检查。

What I have tried so far: I looked up "web scraping in javascript", this brings up a lot of stuff intended to run in nodejs with cheerio for example this tutorial , and also stuff like pjscrape which requires PhantomJS.到目前为止我所尝试的:我查找了“javascript 中的网络抓取”,这带来了很多打算在 nodejs 中运行的东西,例如本教程,还有像pjscrape这样需要 PhantomJS 的东西。 However, I couldn't find anything equivalent that is intended to run in a browser.但是,我找不到任何打算在浏览器中运行的等效项。

PS This is interesting: Firefox setting to enable cross domain ajax request Apparently Chrome --disable-web-security takes care of the cross-origin/cross-domain issues. PS这很有趣: Firefox设置启用跨域ajax请求显然Chrome --disable-web-security负责跨域/跨域问题。 Firefox equivalent?火狐等效?

PS Looks like ForceCORS extension to Firefox is also useful: http://www-jo.se/f.pfleger/forcecors I'm not sure if I'll be able to install that though. PS Firefox 的 ForceCORS 扩展似乎也很有用: http ://www-jo.se/f.pfleger/forcecors 我不确定我是否能够安装它。

PS This is a nice collection of ways to allow cross-domain in different browsers: http://romkey.com/2011/04/23/getting-around-same-origin-policy-in-web-browsers/ Sadly, the suggested Firefox solution doesn't work in versions >=5. PS这是允许在不同浏览器中跨域的方法的一个很好的集合: http: //romkey.com/2011/04/23/getting-around-same-origin-policy-in-web-browsers/ 可悲的是,建议的 Firefox 解决方案在 >=5 的版本中不起作用。

edit: looks like import.io service shut down and the url points to something completely different now.编辑:看起来 import.io 服务已关闭,并且 url 现在指向完全不同的东西。 consider this answer obsolete.认为这个答案已经过时了。

try to do it with import.io : ( basically a scraping service with REST API)尝试使用import.io :(基本上是一个使用 REST API 的抓取服务)

as soon as i have a example javascript call to the API i can provide it.只要我有一个对 API 的示例 javascript 调用,我就可以提供它。 Or you check the docs yourself.或者您自己检查文档

Import.io allows you to structure the data you find on webpages into rows and columns, using simple point and click technology. Import.io 允许您使用简单的点击技术将您在网页上找到的数据结构化为行和列。

First you locate your data: navigate to a website using our browser (download it from us here: http://import.io ).首先,您找到您的数据:使用我们的浏览器导航到一个网站(从我们这里下载:http: //import.io )。

Then, enter our dedicated data extraction workflow by clicking the pink IO button in the top right of the Browser.然后,通过单击浏览器右上角的粉红色 IO 按钮进入我们专用的数据提取工作流程。

We will guide you through structuring the data on the page.我们将指导您构建页面上的数据。 You teach import.io how to extract the data by showing us examples of where the data is.您通过向我们展示数据所在位置的示例来教 import.io 如何提取数据。 We create learning algorithms that generalize from these examples to work out how to get all the data on the website.我们创建了从这些示例中进行概括的学习算法,以研究如何获取网站上的所有数据。 The data you collect is stored on our cloud servers to be downloaded and shared.您收集的数据存储在我们的云服务器上以供下载和共享。 And every time you publish to our platform we create an API to get the data programatically so you can easily integrate live web data into your applications or third party analytics and visualization software.每次您发布到我们的平台时,我们都会创建一个 API 以编程方式获取数据,这样您就可以轻松地将实时 Web 数据集成到您的应用程序或第三方分析和可视化软件中。

EDIT:编辑:

If the data recognition works in the browser you can simply access the data by heading to "simple API integration" and Copy the url如果数据识别在浏览器中有效,您可以通过前往“简单 API 集成”并复制 url 来访问数据

在 import.io 中导出数据

the url u can paste here:您可以在此处粘贴的网址:

function reqListener () {
    console.log(JSON.parse(this.responseText));
    return JSON.parse(this.responseText);
}

var oReq = new XMLHttpRequest();
oReq.addEventListener("load", reqListener);
oReq.open("GET", "yourUrlFromClipboardComesHere", true);
oReq.send();

xhr request source xhr 请求源

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM