Zombie.js in node.js fails to scrape certain websites

Question

The simple script below returns a bunch of rubbish. It works for most websites, but not william hill:

var Browser = require("zombie");
var assert = require("assert");

// Load the page from localhost
browser = new Browser()
browser.visit("http://sports.williamhill.com/bet/en-gb/betting/y/5/et/Football.html", function () {
browser.wait(function(){
console.log(browser.html());
});
});

run with node

output:

S J ꪙRUݒ kf 6 Efr2 Riz ^ 0 X { ^ a yp p Ή ` ( S]- 'N 8q / ? ݻ u; ݇ ׯ Eiٲ> - 3 ۗG Ee , mF MI Q ۲ ڊ ZG O J ^S C~g JO 緹 Oݎ P ET n;v v D tvJn J 8' 햷r v: m J Z nh ] Z .{Z Ӳl B' .¶D ~$n / u" z Ni "ǋ \\00_I\\00\\ S O E8{" m; h ,o Q y ; a[ c q D 띊? /|?: ; Z!} / wے h < % A K=-a ~'

(actual output is much longer)

Anyone know why this happens, and specifically why it happens on the only site i actually want to scrape???

Thanks

Answer 1

I have abandoned this method long ago, but in case anyone is interested I got a reply from one of the zombie.js devs.

https://github.com/assaf/zombie/issues/251#issuecomment-5969175

He says: "Zombie will now send accept-encoding header to indicate it does not support gzip."

Thank you all who looked into this.

Answer 2

The same code works for other sites (which also use gzip to reply) so it's not a code problem.

My guess is the site is detecting that you are not running a browser and defending against data extraction.

Zombie.js in node.js fails to scrape certain websites

Question

2 answers

solution1
1 ACCPTED 2012-06-07 13:38:43

solution2
0 2012-01-05 12:51:00

Zombie.js in node.js fails to scrape certain websites

Question

2 answers

solution1 1 ACCPTED 2012-06-07 13:38:43

solution2 0 2012-01-05 12:51:00

solution1
1 ACCPTED 2012-06-07 13:38:43

solution2
0 2012-01-05 12:51:00