简体   繁体   中英

Express app.get() equivalent in CasperJS

I have built a simple web scraper that scrapes a website and outputs the data I need when I visit this URL - localhost:3434/page . I implemented this functionality using the express app.get() method.

I have the following questions,

1) I want to know if there is a way to implement this functionality in CasperJS.

2) Is there a way to make this code start scraping after I visit the URL - localhost:8081/scrape . I don't think I am creating the endpoint correctly because it is starting the scrape before I even visit the URL

3) When I visit the URL it gives me an error saying that the URL is not available.

I think all of these problems will be solved if I can set the end point correctly to localhost:3434/page in CasperJS. I don't need the results to appear on the page. I only need it to start scraping when I visit that URL.

Below is the code I developed to scrape a website and create a server in Casper.

var server = require('webserver').create();

var service = server.listen(3434, function(request, response) {
    var casper = require('casper').create({
    logLevel:"verbose",
    debug:true
    });

    var links;
    var name;
    var paragraph;
    var firstName;
    var expression = /[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?/gi;
    var regex = new RegExp(expression);

    casper.start('http://www.home.com/professionals/c/oho,-TN');

    casper.then(function getLinks(){
         links = this.evaluate(function(){
            var links = document.getElementsByClassName('pro-title');
            links = Array.prototype.map.call(links,function(link){
                return link.getAttribute('href');
            });
            return links;
        });
    });

    casper.then(function(){
        this.each(links,function(self,link){
          if (link.match(regex)) {
            self.thenOpen(link,function(a){
              var firstName = this.fetchText('div.info-list-text');
              this.echo(firstName);
            });
          }
        });
    });

    casper.run(function() {
            response.statusCode = 200;
            response.write(firstName);
            response.close();              
         });
    });

The webserver you used in your CasperJS script is PhantomJS's Web Server Module which is "intended for ease of communication between PhantomJS scripts and the outside world and is not recommended for use as a general production server"

You should not build your web server in PhantomJS. Checkout these node-phantom bridges that will let you use Phantom from your regular NodeJS web server:

SpookyJS is a driver particularly for CasperJS, whereas others are for PhantomJS only.

Although CasperJS allows being loaded from within PhantomJS so you can at least use it in Phridge (not sure about others) since it has a .run function which runs any function directly inside PhantomJS environment:

casperPath = path.join(require.resolve('casperjs/bin/bootstrap'), '/../..');
phantom.run(casperPath, function(casperPath) {
    phantom.casperPath = casperPath;
    phantom.injectJs(casperPath + '/bin/bootstrap.js');
    casper = require('casper').create();
    ...

Besides the ones that use PhantomJS, there's also others:

ZombieJS uses native NodeJS libraries which makes it the fastest and most natural to use in a NodeJS app. Although it's meant more for testing purposes and may not work on all sites that other scrapers might.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM