简体   繁体   English

寻找一种使用JS抓取HTML的方法

[英]Looking for a way to scrape HTML with JS

As the title suggests, I'm looking for a hopefully straightforward way of scraping all of the HTML from a webpage. 顾名思义,我正在寻找一种希望直接的方法来从网页中抓取所有HTML。 Storing it in a string perhaps, and then navigating through that string to pull out the desired element. 也许将其存储在字符串中,然后在该字符串中导航以拉出所需的元素。

Specifically, I want to scrape my twitter page and display my profile picture inside a new div. 具体来说,我想抓取自己的Twitter页面,并在新的div中显示我的个人资料图片。 I know there are several tools for doing just this, but I would anyone have some code examples or suggestions for how I might do this myself? 我知道有几种工具可以做到这一点,但是我会有人提供一些代码示例或建议来说明如何自己做到这一点吗?

Thanks a lot 非常感谢

UPDATE UPDATE

After a very helpful response from TJ Crowder I did some more searching online and found this resource . 在TJ Crowder做出了非常有帮助的回应之后,我做了更多的在线搜索,找到了这个资源

In theory, this is easy. 从理论上讲,这很容易。 You just do an ajax call to get the text of the page, then use jQuery to turn that into a disconnected DOM, and then use all the usual jQuery tools to find and extract what you need. 您只需执行ajax调用即可获取页面文本,然后使用jQuery将其转换为断开连接的DOM,然后使用所有常用的jQuery工具查找并提取所需的内容。

$.ajax({
    url:     "http://example.com/some/path",
    success: function(html) {
        var tree = $(html);
        var imgsrc = tree.find("img.some-class").attr("src");
        if (imgsrc) {
            // ...add the image to your page
        }
    }
});

But (and it's a big one) it's not likely to work, because of the Same Origin Policy , which prevents cross-origin ajax calls. 但是 (这是一个很大的问题),它不太可能起作用,因为Same Origin Policy可以防止跨域的ajax调用。 Certain individual sites may have an open CORS policy, but most won't, and of course supporting CORS on IE8 and IE9 requires an extra jQuery plug-in . 某些站点可能具有开放的CORS策略,但大多数站点不会,并且在IE8和IE9上支持CORS当然需要额外的jQuery插件

So to do this with sites that don't allow your origin via CORS, there must be a server involved. 因此,对于不允许您通过CORS起源的网站,必须使用一台服务器。 It can be your server and you can grab the text of the page you want using server-side code and then send it to your page via ajax (or just build the bits you want into your page when you first render it). 它可以是您的服​​务器,您可以使用服务器端代码获取所需页面的文本,然后通过ajax将其发送到您的页面(或在首次渲染时将所需的位构建到页面中)。 All of the usual server-side stacks (PHP, Node, ASP.Net, JVM, ...) have the ability to grab web pages. 所有常用的服务器端堆栈(PHP,Node,ASP.Net,JVM等)都可以抓取网页。 Or, in some cases, you may be able to use YQL as a cross-domain proxy , using their server rather than your own. 或者,在某些情况下,您可以使用YQL作为服务器的跨域代理 ,而不是使用自己的服务器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM