用Node.js进行爬取

Question

I have a strange problem- 我有一个奇怪的问题

Because of the strange way that this site presents time data, I wanted to write a small parser. 由于该站点显示时间数据的方式很奇怪，因此我想编写一个小型解析器。

I was testing my code on one specific url - 我在一个特定的网址上测试我的代码-

http://www.sfweekly.com/search/results/?keyword=*&type=events#type:events/page:57/ http://www.sfweekly.com/search/results/?keyword=*&type=events#type:events/page:57/

Note that when you visit the url, the page first loads a bunch of entries and then changes those entries. 请注意，当您访问URL时，页面首先加载一堆条目，然后更改这些条目。 What's happening there is that it's going to the first page and then re-directing. 发生的事情是转到第一页，然后重新定向。 How do I get around that? 我该如何解决？

To scrape I'm using 要刮我正在使用

jsdom.env({
    html: url,
    scripts:['http://code.jquery.com/jquery.js'],
    done: function(errors,window){
                 //doSomething

I originally thought I could get around this with a pause, but that's not the case. 我原本以为我可以停下来解决这个问题，但事实并非如此。 Is there some way I can 'listen' for a redirect and wait until the real page has been loaded? 有什么方法可以“监听”重定向并等待直到实际页面加载完毕？ I also have a feeling that the new entries may be entered with a jquery replace, but I'm not sure how to test that theory. 我也有一种感觉，新条目可以用一个jquery替换来输入，但是我不确定如何测试该理论。

Answer 1

Scraping ajax-y sites like this can be a real pain. 像这样刮擦ajax-y站点可能是一个真正的痛苦。 In this case, it seems there is a way around it, because you can snoop around in the developer tools in your browser of choice and discover the ajax endpoint, and use that directly: 在这种情况下，似乎有一种解决方法，因为您可以在所选浏览器中的开发人员工具中窥探一下，发现ajax端点，然后直接使用它：

http://www.sfweekly.com/search/ajaxsearch/type%3aevents/page:57/ http://www.sfweekly.com/search/ajaxsearch/type%3aevents/page:57/

In some scenarios, javascript-y sites that intentionally try to foil scrapers, you have to use some kind of headless or automated browser situation. 在某些情况下，有意尝试阻止刮板的javascript-y网站必须使用某种无头或自动浏览器的情况。 That's slow and annoying, avoid it if you can. 这很慢而且很烦人，请尽量避免。

用Node.js进行爬取

问题描述

1 个解决方案

解决方案1
0 已采纳 2013-03-23 06:18:59

用Node.js进行爬取

问题描述

1 个解决方案

解决方案1 0 已采纳 2013-03-23 06:18:59

解决方案1
0 已采纳 2013-03-23 06:18:59