简体   繁体   English

重新运行casperjs脚本

[英]Re-run casperjs script

I'm relatively new to CasperJS, have wrote simple scraping scripts, and now I'm in a kind of more difficult task: I want to scrape some sort of data from a list of urls, but some pages some times "fail", I've a captcha solving service because a few of this pages have captcha by default, but phantomjs is rather inconsistent in rendering some captchas, sometimes they load, sometimes they don't. 我是CasperJS的新手,已经编写了简单的抓取脚本,现在我面临着更加艰巨的任务:我想从网址列表中抓取某种数据,但是有些页面有时会“失败”,我有一个验证码解析服务,因为默认情况下此页面中的一些页面具有验证码,但是phantomjs在呈现某些验证码时有时会不一致,有时会加载,有时则不会。

The solution I thought was to rerun the script with the pages that failed to load the captcha in order to get the amount of data I need. 我认为的解决方案是使用无法加载验证码的页面重新运行脚本,以获取所需的数据量。 But I don't seem to get it running, I thought of creating a function with the whole thing and then inside the casper.run() method invoke it and check if the amount of data scraped fulfills the minimum I need if not rerun, But I don't really know how to accomplish it, as for what I've seen casperjs adds the steps to the stack before calling the function (correct me if I'm wrong). 但是我似乎没有让它运行,我想到了用整个东西创建一个函数,然后在casper.run()方法内部调用它,并检查所刮取的数据量是否达到了我需要的最小值(如果不重新运行),但是我真的不知道如何完成它,就我所见,casperjs在调用函数之前将步骤添加到了堆栈中(如果我错了,请更正我)。 Also I'm thinking of something using the run.complete event but not so sure how to do it. 我也在考虑使用run.complete事件,但不确定如何做到这一点。 My script is something like this: 我的脚本是这样的:

// This variable stores the amount of data collected
pCount = 0;
urls = ["http://page1.com","http://page2.com"];    
// Create casperjs instance...
casper.start();

casper.eachThen(urls, function(response) {
    if (pCount < casper.cli.options.number) {
        casper.thenOpen(response.data, function(response) {
        // Here is where the magic goes on
        })
    }
})
casper.run();

Is there anyway I can wrap the casper.eachThen() block in a function and do something like this? 无论如何,我可以将casper.eachThen()块包装在一个函数中并执行类似的操作吗?

casper.start();
function sample () {
    casper.eachThen(urls, function(response) {
        if (pCount < casper.cli.options.number) {
            casper.thenOpen(response.data, function(response) {
            // Here is where the magic goes on
            })
        }
    })
}
casper.run(sample);

Also, I tried using slimerjs as engine to avoid the "inconsistencies", but I couldn't manage to get working the __utils__.sendAjax() method inside a casper.evaluate() I have, so it's a deal-breaker. 另外,我尝试使用slimerjs作为引擎来避免“不一致”,但是我无法设法在我拥有的casper.evaluate()内部使用casper.evaluate() __utils__.sendAjax()方法,所以这是一个大问题。 Or is there a way to do a GET request asynchronously in a separate instance? 还是有一种方法可以在单独的实例中异步执行GET请求? if so, I would appreciate your advise 如果是这样,我将不胜感激您的建议

Update: I never managed to solve it with casperjs, I nonetheless found a workaround for my particular use case, check my answer for more info 更新:我从未设法用casperjs解决它,但是我找到了针对我的特定用例的解决方法,请查看我的答案以获取更多信息

Maybe with the back function, so something like that : 也许带有back功能,所以像这样:

casper.start()
.thenOpen('your url');
.then(function(){
    var count = 0;
    if (this.exists("selector contening the captcha")){
    //continue the script
    }
    else if (count==3){
        this.echo("in 3 attempts, it failed each time");
        this.exit();
    }
    else{
        count++;
        casper.back();//back to the previous step, so will re-open the url
    }
.run();

I never found a way to do this from casper, this is how I solved it: 我从来没有找到一种方法可以从casper中做到这一点,这就是我解决的方法:

There's a program A, that manages user input (in my case written in C#). 有一个程序A,用于管理用户输入(在我的情况下以C#编写)。 This program A is the one that executes the casperjs script, and read it's output. 该程序A是执行casperjs脚本并读取其输出的程序。 If I need to rerun the script, I just output a message with some specifications so that I catch it in the program A. 如果我需要重新运行该脚本,则只输出一条带有某些规范的消息,以便将其捕获到程序A中。

It may not be the best way, but it worked for me. 这可能不是最好的方法,但是对我有用。 Hope it helps 希望能帮助到你

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM