简体   繁体   English

节点js中的媒体蜘蛛

[英]a media spider in node js

I'm working on a project named robot hosting on GitHub . 我正在开发一个名为GitHub上的 robot host的项目。 The job of my project is to fetch medias from the url which is given from the xml config file.And the xml config file has the defined format just as you can see in scripts dir. 我的项目的工作是从xml配置文件给出的url中获取媒体.xml配置文件具有已定义的格式,正如您在脚本dir中看到的那样。

My problem is as below.There are two args: 我的问题如下。有两个参数:

  1. A list which indicates how deep the web link is, and according to the selector(css selector) in the list item, i can find out the media url or the sub page url where i may finally find out the media. 一个列表,指示Web链接的深度,并根据列表项中的选择器(css​​选择器),我可以找到媒体URL或子页面URL,最终可以在其中找到媒体。
  2. An arr which contains the sub page urls. 包含子页面网址的arr。

The simplified example as below: 简化示例如下:

node_list = {..., next = {...,  next= null}};
url_arr = [urls];

I want to iterate all the items in the url arr, so i do as below: 我想遍历URL arr中的所有项目,所以我做如下:

function fetch(url, node) {
    if(node == null) 
        return ;
    // here do something with http request
    var req = http.get('www.google.com', function(res){
        var data = '';
        res.on('data', function(chunk) {
            data += chunk;
        }.on('end', function() {
             // maybe here generate more new urls
             // get another url_list
             node = node.next;
             fetch(url_new, node);
        }
}

// here need to be run in sync
for (url in url_arr) {
     fetch(url, node)
}

As you can see, if use async http request, it must eats all system resources. 如您所见,如果使用异步http请求,则它必须占用所有系统资源。 And i can not control the process. 而且我无法控制该过程。 So do anyone have a good idea to solve this problem? 那么,有谁有解决这个问题的好主意? Or, is nodejs not the proper way to do such jobs? 或者,nodejs不是执行此类工作的正确方法吗?

If the problem is that you get too many HTTP requests simultaneously, you could change the fetch function to operate on a stack of URLs. 如果问题是您同时收到太多HTTP请求,则可以更改fetch函数以对URL堆栈进行操作。

Basically you would do this: 基本上,您可以这样做:

  • When fetch is called, insert the URL into the stack and check if a request is in progress: 调用fetch ,将URL插入堆栈并检查请求是否在进行中:
  • If a request is not running, pick the first url from stack and process it, otherwise do nothing 如果请求未在运行,请从堆栈中选择第一个URL并对其进行处理,否则不执行任何操作
  • When a http request is finished, have it take a new url from the stack and process that http请求完成后,请它从堆栈中获取一个新的url并对其进行处理

This way you can have the for-loop add all the URLs like now, but only one URL is processed at a time so there won't be too much resources being used. 这样,您可以像现在一样,让for循环添加所有URL,但是一次仅处理一个URL,因此不会占用太多资源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM