简体   繁体   中英

Unable to scrape a url by PhanthomJs

I have an page which is probably somehow protected from scraped by headless browsers, but I don't know for sure, of course. The thing is, in a browser it loads well, javascript executes and everything is good. When using phanthomjs , it doesn't, seems like either javascript doesn't execute or some other issue occurs.

How can I find that out? What do you recommend to scrape that page?

Here is a basic phantomjs script that will print to the console if a request to the indicated url was successful or not. This should help you see if you can access the page or not. If you get a success you should be able to scrape. That would make me think it's your JS causing issues and not the headless browser. If you get 'unsuccessful' printed you could set the userAgent setting to make it look like it's a real browser.

var page = new WebPage();
// Uncomment the next line to set the user agent.
//page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
page.open('http://www.google.ca', function (status) {
    if (status !== 'success') {
        console.log('Unsuccessful');
    } else {
        console.log('Successful')
    }
    phantom.exit();
});

Change http://www.google.ca to the url you are wanting.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM