[英]a media spider in node js
I'm working on a project named robot hosting on GitHub . 我正在开发一个名为GitHub上的 robot host的项目。 The job of my project is to fetch medias from the url which is given from the xml config file.And the xml config file has the defined format just as you can see in scripts dir. 我的项目的工作是从xml配置文件给出的url中获取媒体.xml配置文件具有已定义的格式,正如您在脚本dir中看到的那样。
My problem is as below.There are two args: 我的问题如下。有两个参数:
The simplified example as below: 简化示例如下:
node_list = {..., next = {..., next= null}};
url_arr = [urls];
I want to iterate all the items in the url arr, so i do as below: 我想遍历URL arr中的所有项目,所以我做如下:
function fetch(url, node) {
if(node == null)
return ;
// here do something with http request
var req = http.get('www.google.com', function(res){
var data = '';
res.on('data', function(chunk) {
data += chunk;
}.on('end', function() {
// maybe here generate more new urls
// get another url_list
node = node.next;
fetch(url_new, node);
}
}
// here need to be run in sync
for (url in url_arr) {
fetch(url, node)
}
As you can see, if use async http request, it must eats all system resources. 如您所见,如果使用异步http请求,则它必须占用所有系统资源。 And i can not control the process. 而且我无法控制该过程。 So do anyone have a good idea to solve this problem? 那么,有谁有解决这个问题的好主意? Or, is nodejs not the proper way to do such jobs? 或者,nodejs不是执行此类工作的正确方法吗?
If the problem is that you get too many HTTP requests simultaneously, you could change the fetch
function to operate on a stack of URLs. 如果问题是您同时收到太多HTTP请求,则可以更改fetch
函数以对URL堆栈进行操作。
Basically you would do this: 基本上,您可以这样做:
fetch
is called, insert the URL into the stack and check if a request is in progress: 调用fetch
,将URL插入堆栈并检查请求是否在进行中: This way you can have the for-loop add all the URLs like now, but only one URL is processed at a time so there won't be too much resources being used. 这样,您可以像现在一样,让for循环添加所有URL,但是一次仅处理一个URL,因此不会占用太多资源。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.