简体   繁体   中英

Zombie.js in node.js fails to scrape certain websites

The simple script below returns a bunch of rubbish. It works for most websites, but not william hill:

var Browser = require("zombie");
var assert = require("assert");

// Load the page from localhost
browser = new Browser()
browser.visit("http://sports.williamhill.com/bet/en-gb/betting/y/5/et/Football.html", function () {
browser.wait(function(){
console.log(browser.html());
});
});

run with node

output:

S J ꪙRUݒ kf 6 Efr2 Riz ^ 0 X { ^ a yp p Ή ` ( S]- 'N 8q / ? ݻ u; ݇ ׯ Eiٲ> - 3 ۗG Ee , mF MI Q ۲ ڊ ZG O J ^S C~g JO 緹 Oݎ P ET n;v v D tvJn J 8' 햷r v: m J Z nh ] Z .{Z Ӳl B' .¶D ~$n / u" z Ni "Nj \\00_I\\00\\ S O E8{" m; h ,o Q y ; a[ c q D 띊? /|?: ; Z!} / wے h < % A K=-a ~'

(actual output is much longer)

Anyone know why this happens, and specifically why it happens on the only site i actually want to scrape???

Thanks

I have abandoned this method long ago, but in case anyone is interested I got a reply from one of the zombie.js devs.

https://github.com/assaf/zombie/issues/251#issuecomment-5969175

He says: "Zombie will now send accept-encoding header to indicate it does not support gzip."

Thank you all who looked into this.

The same code works for other sites (which also use gzip to reply) so it's not a code problem.

My guess is the site is detecting that you are not running a browser and defending against data extraction.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM