Unable to scrape a url by PhanthomJs

Question

I have an page which is probably somehow protected from scraped by headless browsers, but I don't know for sure, of course. The thing is, in a browser it loads well, javascript executes and everything is good. When using phanthomjs , it doesn't, seems like either javascript doesn't execute or some other issue occurs.

How can I find that out? What do you recommend to scrape that page?

Answer 1

Here is a basic phantomjs script that will print to the console if a request to the indicated url was successful or not. This should help you see if you can access the page or not. If you get a success you should be able to scrape. That would make me think it's your JS causing issues and not the headless browser. If you get 'unsuccessful' printed you could set the userAgent setting to make it look like it's a real browser.

var page = new WebPage();
// Uncomment the next line to set the user agent.
//page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
page.open('http://www.google.ca', function (status) {
    if (status !== 'success') {
        console.log('Unsuccessful');
    } else {
        console.log('Successful')
    }
    phantom.exit();
});

Change http://www.google.ca to the url you are wanting.

Unable to scrape a url by PhanthomJs

Question

1 answers

solution1
1 2016-10-26 12:41:36

Unable to scrape a url by PhanthomJs

Question

1 answers

solution1 1 2016-10-26 12:41:36

solution1
1 2016-10-26 12:41:36