简体   繁体   English

如何使用Phantomjs设置页面抓取之间的时间间隔

[英]How to set time interval between page scraping with Phantomjs

Currently I wrote a script with Phantomjs that scrapes through multiple pages. 目前,我使用Phantomjs编写了一个脚本,该脚本可刮擦多个页面。 My script works but I can't figure out how to set a time interval in between scrapes. 我的脚本有效,但是我不知道如何在两次刮擦之间设置时间间隔。 I tried using setInterval and passing the items from the arrayList about every 5 seconds but it doesn't seem to work. 我尝试使用setInterval并大约每5秒从arrayList传递一次项目,但这似乎不起作用。 My script keeps breaking. 我的剧本不断出现。 Here's my example phantomjs script code: 这是我的示例phantomjs脚本代码:

Without setInterval 没有setInterval

var arrayList = ['string1', 'string2', 'string3'....]

arrayList.forEach(function(eachItem) {
    var webAddress = "http://www.example.com/eachItem"    
    phantom.create(function(ph) {
    return ph.createPage(function(page) {

        return page.open(yelpAddress, function(status) {
            console.log("opened site? ", status);


            page.injectJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {

                setTimeout(function() {
                    return page.evaluate(function() {

                        //code here for gathering data


                    }, function(result) {
                        return result
                        ph.exit();
                    });

                }, 5000);

            });
        });
    });
});

With setInterval : 使用setInterval

var arrayList = ['string1', 'string2', 'string3'....]
var i = 0
var scrapeInterval = setInterval(function() {
    var webAddress = "http://www.example.com/arrayList[i]"    
    phantom.create(function(ph) {
    return ph.createPage(function(page) {

        return page.open(yelpAddress, function(status) {
            console.log("opened site? ", status);


              page.injectJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {

                setTimeout(function() {
                    return page.evaluate(function() {

                           //code here for gathering data


                    }, function(result) {
                           return result
                           ph.exit();
                    });

                }, 5000);

            });
        });
    });
    i++
    if(i > arrayList.length) {
    clearInterval(scrapeInterval);        
}, 5000);

Basically, I would like to send in a chunk of itemss (10-20 of them) within the arrayList and wait for 1 - 2 mins and send in next chunk of items without overwhelming the website. 基本上,我想在arrayList中发送大量项目(其中10-20个),等待1-2分钟,然后发送下一个项目,而不会占用网站arrayList OR if there a way to set a time interval to loop through each item within the array every 2-3 secs. 或者是否可以设置一种时间间隔,以每2-3秒循环遍历数组中的每个项目。

The problem is that PhantomJS is asynchronous, but loop iteration is not. 问题在于PhantomJS是异步的,但循环迭代不是。 All iterations (in the first snippet) are executed even before the first page is loaded. (在第一个代码段中)所有迭代都在加载第一页之前执行。 You're essentially generating multiple such processes which run at the same time. 实际上,您正在生成同时运行的多个此类进程。

You can use something like async to let it run sequentially: 您可以使用async之类的东西让它按顺序运行:

phantom.create(function(ph) {
    ph.createPage(function(page) {
        var arrayList = ['string1', 'string2', 'string3'....];

        var tasks = arrayList.map(function(eachItem) {
            return function(callback){
                var webAddress = "http://www.example.com/" + eachItem;
                page.open(webAddress, function(status) {
                    console.log("opened site? ", status);

                    page.injectJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {

                        setTimeout(function() {
                            return page.evaluate(function() {
                                //code here for gathering data
                            }, function(result) {
                                callback(null, result);
                            });
                        }, 5000);
                    });
                });
            };
        });

        async.series(tasks, function(err, results){
            console.log("Finished");
            ph.exit();
        });
    });
});

Of course you can also move phantom.create() inside of each task which will create a separate process for each request, but the code above will be faster. 当然,您也可以在每个任务中移动phantom.create() ,这将为每个请求创建一个单独的进程,但是上面的代码会更快。

You have some typos in the second snippet where you added the setInterval approach: 在添加了setInterval方法的第二个片段中,您有一些错别字:

var arrayList = ['string1', 'string2', 'string3'];
var i = 0;
var scrapeInterval = setInterval(function () {
    var webAddress = "http://www.example.com/arrayList[i]"
    phantom.create(function (ph) {
        return ph.createPage(function (page) {

            return page.open(yelpAddress, function (status) {
                console.log("opened site? ", status);


                page.injectJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function () {

                    setTimeout(function () {
                        return page.evaluate(function () {
                            //code here for gathering data
                        }, function (result) {
                            return result
                            ph.exit();
                        });

                    }, 5000);

                });
            });
        });

        i++;
        if (i > arrayList.length) {
            clearInterval(scrapeInterval);
        } //This was missing;
    }); //This was missing;
}, 5000);

And something i've noticed, is the return statement in the following timeout: 我注意到的是以下超时中的return语句:

setTimeout(function () {
    return page.evaluate(function () {
        //code here for gathering data
    }, function (result) {
        return result
        ph.exit();
    });
}, 5000);

ph.exit(); will never be reached, i don't know if this will cause any issue for you but you might want to take a look at it. 永远不会达到,我不知道这是否会给您造成任何问题,但您可能想看看。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM