[英]Re-run casperjs script
I'm relatively new to CasperJS, have wrote simple scraping scripts, and now I'm in a kind of more difficult task: I want to scrape some sort of data from a list of urls, but some pages some times "fail", I've a captcha solving service because a few of this pages have captcha by default, but phantomjs is rather inconsistent in rendering some captchas, sometimes they load, sometimes they don't. 我是CasperJS的新手,已经编写了简单的抓取脚本,现在我面临着更加艰巨的任务:我想从网址列表中抓取某种数据,但是有些页面有时会“失败”,我有一个验证码解析服务,因为默认情况下此页面中的一些页面具有验证码,但是phantomjs在呈现某些验证码时有时会不一致,有时会加载,有时则不会。
The solution I thought was to rerun the script with the pages that failed to load the captcha in order to get the amount of data I need. 我认为的解决方案是使用无法加载验证码的页面重新运行脚本,以获取所需的数据量。 But I don't seem to get it running, I thought of creating a function with the whole thing and then inside the
casper.run()
method invoke it and check if the amount of data scraped fulfills the minimum I need if not rerun, But I don't really know how to accomplish it, as for what I've seen casperjs adds the steps to the stack before calling the function (correct me if I'm wrong). 但是我似乎没有让它运行,我想到了用整个东西创建一个函数,然后在
casper.run()
方法内部调用它,并检查所刮取的数据量是否达到了我需要的最小值(如果不重新运行),但是我真的不知道如何完成它,就我所见,casperjs在调用函数之前将步骤添加到了堆栈中(如果我错了,请更正我)。 Also I'm thinking of something using the run.complete
event but not so sure how to do it. 我也在考虑使用
run.complete
事件,但不确定如何做到这一点。 My script is something like this: 我的脚本是这样的:
// This variable stores the amount of data collected
pCount = 0;
urls = ["http://page1.com","http://page2.com"];
// Create casperjs instance...
casper.start();
casper.eachThen(urls, function(response) {
if (pCount < casper.cli.options.number) {
casper.thenOpen(response.data, function(response) {
// Here is where the magic goes on
})
}
})
casper.run();
Is there anyway I can wrap the casper.eachThen()
block in a function and do something like this? 无论如何,我可以将
casper.eachThen()
块包装在一个函数中并执行类似的操作吗?
casper.start();
function sample () {
casper.eachThen(urls, function(response) {
if (pCount < casper.cli.options.number) {
casper.thenOpen(response.data, function(response) {
// Here is where the magic goes on
})
}
})
}
casper.run(sample);
Also, I tried using slimerjs as engine to avoid the "inconsistencies", but I couldn't manage to get working the __utils__.sendAjax()
method inside a casper.evaluate()
I have, so it's a deal-breaker. 另外,我尝试使用slimerjs作为引擎来避免“不一致”,但是我无法设法在我拥有的
casper.evaluate()
内部使用casper.evaluate()
__utils__.sendAjax()
方法,所以这是一个大问题。 Or is there a way to do a GET request asynchronously in a separate instance? 还是有一种方法可以在单独的实例中异步执行GET请求? if so, I would appreciate your advise
如果是这样,我将不胜感激您的建议
Update: I never managed to solve it with casperjs, I nonetheless found a workaround for my particular use case, check my answer for more info 更新:我从未设法用casperjs解决它,但是我找到了针对我的特定用例的解决方法,请查看我的答案以获取更多信息
Maybe with the back function, so something like that : 也许带有back功能,所以像这样:
casper.start()
.thenOpen('your url');
.then(function(){
var count = 0;
if (this.exists("selector contening the captcha")){
//continue the script
}
else if (count==3){
this.echo("in 3 attempts, it failed each time");
this.exit();
}
else{
count++;
casper.back();//back to the previous step, so will re-open the url
}
.run();
I never found a way to do this from casper, this is how I solved it: 我从来没有找到一种方法可以从casper中做到这一点,这就是我解决的方法:
There's a program A, that manages user input (in my case written in C#). 有一个程序A,用于管理用户输入(在我的情况下以C#编写)。 This program A is the one that executes the casperjs script, and read it's output.
该程序A是执行casperjs脚本并读取其输出的程序。 If I need to rerun the script, I just output a message with some specifications so that I catch it in the program A.
如果我需要重新运行该脚本,则只输出一条带有某些规范的消息,以便将其捕获到程序A中。
It may not be the best way, but it worked for me. 这可能不是最好的方法,但是对我有用。 Hope it helps
希望能帮助到你
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.