简体   繁体   English

用Node.js进行爬取

[英]Scraping with Node.js

I have a strange problem- 我有一个奇怪的问题

Because of the strange way that this site presents time data, I wanted to write a small parser. 由于该站点显示时间数据的方式很奇怪,因此我想编写一个小型解析器。

I was testing my code on one specific url - 我在一个特定的网址上测试我的代码-

http://www.sfweekly.com/search/results/?keyword=*&type=events#type:events/page:57/ http://www.sfweekly.com/search/results/?keyword=*&type=events#type:events/page:57/

Note that when you visit the url, the page first loads a bunch of entries and then changes those entries. 请注意,当您访问URL时,页面首先加载一堆条目,然后更改这些条目。 What's happening there is that it's going to the first page and then re-directing. 发生的事情是转到第一页,然后重新定向。 How do I get around that? 我该如何解决?

To scrape I'm using 要刮我正在使用

jsdom.env({
    html: url,
    scripts:['http://code.jquery.com/jquery.js'],
    done: function(errors,window){
                 //doSomething

I originally thought I could get around this with a pause, but that's not the case. 我原本以为我可以停下来解决这个问题,但事实并非如此。 Is there some way I can 'listen' for a redirect and wait until the real page has been loaded? 有什么方法可以“监听”重定向并等待直到实际页面加载完毕? I also have a feeling that the new entries may be entered with a jquery replace, but I'm not sure how to test that theory. 我也有一种感觉,新条目可以用一个jquery替换来输入,但是我不确定如何测试该理论。

Scraping ajax-y sites like this can be a real pain. 像这样刮擦ajax-y站点可能是一个真正的痛苦。 In this case, it seems there is a way around it, because you can snoop around in the developer tools in your browser of choice and discover the ajax endpoint, and use that directly: 在这种情况下,似乎有一种解决方法,因为您可以在所选浏览器中的开发人员工具中窥探一下,发现ajax端点,然后直接使用它:

http://www.sfweekly.com/search/ajaxsearch/type%3aevents/page:57/ http://www.sfweekly.com/search/ajaxsearch/type%3aevents/page:57/

In some scenarios, javascript-y sites that intentionally try to foil scrapers, you have to use some kind of headless or automated browser situation. 在某些情况下,有意尝试阻止刮板的javascript-y网站必须使用某种无头或自动浏览器的情况。 That's slow and annoying, avoid it if you can. 这很慢而且很烦人,请尽量避免。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM