简体   繁体   English

node.js中的Zombie.js无法抓取某些网站

[英]Zombie.js in node.js fails to scrape certain websites

The simple script below returns a bunch of rubbish. 下面的简单脚本返回一堆垃圾。 It works for most websites, but not william hill: 它适用于大多数网站,但不适用于William Hill:

var Browser = require("zombie");
var assert = require("assert");

// Load the page from localhost
browser = new Browser()
browser.visit("http://sports.williamhill.com/bet/en-gb/betting/y/5/et/Football.html", function () {
browser.wait(function(){
console.log(browser.html());
});
});

run with node 与节点一起运行

output: 输出:

S J ꪙRUݒ kf 6 Efr2 Riz ^ 0 X { ^ a yp p Ή ` ( S]- 'N 8q / ? ݻ u; ݇ ׯ Eiٲ> - 3 ۗG Ee , mF MI Q ۲ ڊ ZG O J ^S C~g JO 緹 Oݎ P ET n;v v D tvJn J 8' 햷r v: m J Z nh ] Z .{Z Ӳl B' .¶D ~$n / u" z Ni "Nj \\00_I\\00\\ S O E8{" m; h ,o Q y ; a[ c q D 띊? /|?: ; Z!} / wے h < % A K=-a ~' S J ꪙRUk kf 6 Efr2 Riz 0 X { ^ a yp p Ή`(S] - 'N8q /ݻU;?݇ׯ荣ٲ> -3 GEE,mFMIQ2ڊZGOJ ^ SC〜gJO缇öݎP ET n;v D tvJn J 8' 햷r v: m J Z nh ] 。 Z. {ZӲlB'.¶D〜$ N / U “zNi” nj\\ 00_I \\ 00 \\ SOE8 {“米; H,oQy;一个[CQD띊/ | ?:;!Z} / w的ےħ<%AK = -a〜”

(actual output is much longer) (实际输出要长得多)

Anyone know why this happens, and specifically why it happens on the only site i actually want to scrape??? 任何人都知道为什么会发生这种情况,特别是为什么它会在我真正想抓取的唯一网站上发生???

Thanks 谢谢

I have abandoned this method long ago, but in case anyone is interested I got a reply from one of the zombie.js devs. 我很早以前就放弃了这种方法,但是如果有人感兴趣,我会从一位zombie.js开发人员那里得到答复。

https://github.com/assaf/zombie/issues/251#issuecomment-5969175 https://github.com/assaf/zombie/issues/251#issuecomment-5969175

He says: "Zombie will now send accept-encoding header to indicate it does not support gzip." 他说:“僵尸现在将发送accept-encoding头,以表明它不支持gzip。”

Thank you all who looked into this. 谢谢所有调查此事的人。

The same code works for other sites (which also use gzip to reply) so it's not a code problem. 相同的代码可用于其他站点(也使用gzip进行答复),因此这不是代码问题。

My guess is the site is detecting that you are not running a browser and defending against data extraction. 我的猜测是该站点正在检测到您没有运行浏览器并防御了数据提取。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM