简体   繁体   English

如何使用XUL / Firefox扩展执行后台加载和页面抓取

[英]How to perform a background load and scraping of a page with XUL/Firefox Extension

I want to scrape the user pages of SO to give the owners of my toolbar the updated information on their questions/answers/etc... 我想刮掉SO的用户页面,给我的工具栏的所有者提供关于他们的问题/答案/等的更新信息......

This means I need to do this in the background, parse the pages, extract the content, compare it with the last run and then present the results either on the toolbar or the status bar, or alternatively, on a pop-up window of some kind. 这意味着我需要在后台执行此操作,解析页面,提取内容,将其与上次运行进行比较,然后在工具栏或状态栏上显示结果,或者在某些弹出窗口中显示结果类。 And all of this has to be done while the user is going about his business not being interrupted or even being on SO. 所有这一切都必须在用户处理他的业务时不要被打断甚至是在SO上。

I've searched quite thoroughly both on Google and on the Mozilla Wiki for some kind of hint. 我在谷歌和Mozilla Wiki上都进行了相当彻底的搜索,以获得某种暗示。 I've even gone to the extent of downloading a few other extensions that I think do the same. 我甚至已经下载了一些我想做同样的其他扩展程序。 Unfortunately I've not had the time to go through all of them and the ones I've looked at, all use data APIs(Services, WebServices, XML), not html scrapping. 不幸的是,我没有时间浏览所有这些和我看过的,都使用数据API(服务,Web服务,XML),而不是html报废。

Old question text 老问题文本

I'm looking for a nice place to learn how I can load a page inside a function called buy the infamous set_timeout() to process a screen-scraping in the background. 我正在寻找一个好地方来学习如何在一个名为buy臭名昭着的set_timeout()的函数中加载页面来处理后台的屏幕抓取。

My idea is to present the results of such scraping in a status bar extension, just in case any thing changed from the last run. 我的想法是在状态栏扩展中显示这种抓取的结果,以防万一从上次运行中发生任何变化。

Is there a hidden overlay or some other subterfuge? 是否有隐藏的叠加或其他一些诡计?

In case of XUL/Firefox, what you need is the nsIIOService interface, which you can get like this: 对于XUL / Firefox,你需要的是nsIIOService接口,你可以这样得到:

var mIOS = Components.classes["@mozilla.org/network/io-service;1"].
   getService(Components.interfaces.nsIIOService);

Then you need to create a channel, and open an asynchronous link: 然后你需要创建一个频道,并打开一个异步链接:

var channel = mIOS.newChannel(urlToOpen, 0, null);
channel.asyncOpen(new StreamListener(), channel);

The key here is the StreamListener object: 这里的关键是StreamListener对象:

var StreamListener = function() {
    return {
        QueryInterface: function(aIID) {
            if (aIID.equals(Components.interfaces.nsIStreamListener) ||
                aIID.equals(Components.interfaces.nsISupportsWeakReference) ||
                aIID.equals(Components.interfaces.nsISupports))
                return this;
            throw Components.results.NS_NOINTERFACE;

        onStartRequest: function(aRequest, aContext)
           { return 0; },

        onStopRequest: function(aRequest, aChannel /* aContext */, aStatusCode)
           { return 9; },

        onDataAvailable: function(aRequest, aContext, aStream, aOffset, aCount)
           { return 0; }
    };
}

You have to fill in the details in the onStartRequest , onStopRequest , onDataAvailable functions, but that should be enough to get you going. 您必须在onStartRequestonStopRequestonDataAvailable函数中填写详细信息,但这应该足以让您前进。 You can have a look at how I used this interface in my Firefox extension (it is called IdentFavIcon, and it can be found on the mozilla add-ons site). 您可以查看我在Firefox扩展中如何使用此接口(它名为IdentFavIcon,可以在mozilla附加组件网站上找到)。

The part which I'm uncertain about is how you can trigger this page request from time to time, set_timeout() should probably work, though. 我不确定的部分是如何不时触发此页面请求,但set_timeout()应该可以正常工作。

Edit: 编辑:

  1. See example here (see section Downloading Images ) for an example on how to collect downloaded data into a single variable; 有关如何将下载的数据收集到单个变量中的示例,请参阅此处的示例(请参阅下载图像部分); and
  2. See this page on how to convert an HTML source into a DOM tree. 请参阅此页面 ,了解如何将HTML源转换为DOM树。

HTH. HTH。

I am not sure if I understood the question completely, but will try to answer a few apparent alternative questions: 我不确定我是否完全理解了这个问题,但会尝试回答一些明显的替代问题:

If you are looking for static web page scraping BeautifulSoup (Python) is one of the best and easiest. 如果您正在寻找静态网页抓取BeautifulSoup (Python)是最好和最简单的之一。

If you are looking for change in a Ajax based page, which changes over time, you will have to keep running the code in an infinite loop. 如果您正在寻找基于Ajax的页面中的更改(随着时间的推移而发生变化),则必须在无限循环中继续运行代码。 But do not poll the site too frequently, it will detect a bandwidth consumption and may block your IP, so poll in some interval. 但是不要过于频繁地轮询网站,它会检测到带宽消耗并可能会阻止您的IP,因此请在某个时间间隔内进行轮询。

If you are looking to scrape some javascript rendered tickers or something, that cannot be done until the page is rendered, hence not possible with BeautifulSoup alone. 如果你想要抓取一些javascript渲染的代码或其他东西,那么在页面渲染之前就无法完成,因此单独使用BeautifulSoup是不可能的。 you will have to use a headless browser like Crowbar - Similie (uses XULRunner) which renders the javascript content on a headless browser and the output of this rendered content can be used as an input to the BeautifulSoup scraper. 你将不得不使用像Crowbar一样的无头浏览器- Similie (使用XULRunner),它在无头浏览器上呈现javascript内容,并且此呈现内容的输出可以用作BeautifulSoup scraper的输入。

From privileged JavaScript, ie JS in an extension, you are allowed to create hidden iframe s; 从特权JavaScript,即扩展中的JS,您可以创建隐藏的iframe ; downloading the specified page is as simple as setting the location on this frame. 下载指定页面就像在此帧上设置位置一样简单。

If you're pulling down a simple, static page that you own, set_timeout should be fine. 如果您正在下载一个简单的静态页面, set_timeout应该没问题。 But in that case, why not use XHR? 但在那种情况下,为什么不使用XHR?

If you're pulling down arbitrary pages, ones with dynamic elements or lots of content, I'd recommend triggering your scrape of the page using Document.onload event handlers instead. 如果您正在下载具有动态元素或大量内容的任意页面,我建议使用Document.onload事件处理程序来触发您的页面刮擦。 It's way more reliable, and you can get clever about scraping the page at the earliest possible moment, but when you know the required content is there. 这样更可靠,你可以尽可能早地抓取页面,但是当你知道所需的内容就在那里时。

I don't think there's a specific tutorial on this, but the Mozilla Developer Center , which I'm sure you've already found, is absolutely excellent - the best online technical documentation in my opinion! 我不认为有关于此的具体教程,但我确信您已经找到的Mozilla开发人员中心绝对非常出色 - 我认为最好的在线技术文档!

看看XMLHttpRequest ,应该让你入门。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM