[英]Scraping with Node.js
I have a strange problem- 我有一个奇怪的问题
Because of the strange way that this site presents time data, I wanted to write a small parser. 由于该站点显示时间数据的方式很奇怪,因此我想编写一个小型解析器。
I was testing my code on one specific url - 我在一个特定的网址上测试我的代码-
http://www.sfweekly.com/search/results/?keyword=*&type=events#type:events/page:57/ http://www.sfweekly.com/search/results/?keyword=*&type=events#type:events/page:57/
Note that when you visit the url, the page first loads a bunch of entries and then changes those entries. 请注意,当您访问URL时,页面首先加载一堆条目,然后更改这些条目。 What's happening there is that it's going to the first page and then re-directing.
发生的事情是转到第一页,然后重新定向。 How do I get around that?
我该如何解决?
To scrape I'm using 要刮我正在使用
jsdom.env({
html: url,
scripts:['http://code.jquery.com/jquery.js'],
done: function(errors,window){
//doSomething
I originally thought I could get around this with a pause, but that's not the case. 我原本以为我可以停下来解决这个问题,但事实并非如此。 Is there some way I can 'listen' for a redirect and wait until the real page has been loaded?
有什么方法可以“监听”重定向并等待直到实际页面加载完毕? I also have a feeling that the new entries may be entered with a jquery replace, but I'm not sure how to test that theory.
我也有一种感觉,新条目可以用一个jquery替换来输入,但是我不确定如何测试该理论。
Scraping ajax-y sites like this can be a real pain. 像这样刮擦ajax-y站点可能是一个真正的痛苦。 In this case, it seems there is a way around it, because you can snoop around in the developer tools in your browser of choice and discover the ajax endpoint, and use that directly:
在这种情况下,似乎有一种解决方法,因为您可以在所选浏览器中的开发人员工具中窥探一下,发现ajax端点,然后直接使用它:
http://www.sfweekly.com/search/ajaxsearch/type%3aevents/page:57/ http://www.sfweekly.com/search/ajaxsearch/type%3aevents/page:57/
In some scenarios, javascript-y sites that intentionally try to foil scrapers, you have to use some kind of headless or automated browser situation. 在某些情况下,有意尝试阻止刮板的javascript-y网站必须使用某种无头或自动浏览器的情况。 That's slow and annoying, avoid it if you can.
这很慢而且很烦人,请尽量避免。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.