简体   繁体   中英

Unable to scrape a particular web page by PhantomJS

I am trying a simple code using phantomJS but no luck.

var page = new WebPage();
var system = require('system');
var site=system.args[1];
var page = require('webpage').create();

page.onError = function (msg, trace) 
{
    console.log(msg);
    trace.forEach(function(item) {
    console.log(' ', item.file, ':', item.line);
})
}
page.open("https://www.mightydeals.co.uk/Products/all/National/Grey-    
Small/132212", function(){

var p=page.evaluate(function(){
return [].map.call(document.querySelectorAll('#productInformation'),    

function(link) {
        return link.innerText;
    });
    }); 
     console.log(p);
 });
phantom.exit();
});

The page is above in the function, and also here I am representing: Link to page

I am getting errors and null output only.

I need to get the product descriptions but its not giving any description but errors.

I can see the page has error itself by console that says

Uncaught SyntaxError: Unexpected token <

Is the page error causing problem or anything else, please suggest/advice.

The default PhantomJS requests (without headers settings), are interpreted as a mobile device for some pages. In this case, when you call page.open , the requested url is redirect to http://m.mightydeals.co.uk/index.html#dealList/productId=132212&menu1Id=1&menu2Id=0& which doesn't have any #productInformation element.

You can check this behavior with page.render('page.png') (will take a screenshot) inside page.open callback and before page.evaluate .

A quick fix for this is to set a custom header before page.open .

page.customHeaders = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:42.0) Gecko/20100101 Firefox/42.0',
    'Accept': '*/*',
    'Accept-Language': 'nb-NO,nb;q=0.9,no-NO;q=0.8,no;q=0.6,nn-NO;q=0.5,nn;q=0.4,en-US;q=0.3,en;q=0.1',
    'Connection': 'keep-alive'
};

or get the elements to be scrapped in mobile version of page.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM