简体繁体 English

递归使用 Scrapy 从网站上抓取网页

[英]Recursive use of Scrapy to scrape webpages from a website

原文 2011-02-02 16:08:27 1 2 python/ web-scraping/ scrapy

I have recently started to work with Scrapy.我最近开始使用 Scrapy。 I am trying to gather some info from a large list which is divided into several pages(about 50).我试图从一个大列表中收集一些信息，该列表分为几页（大约 50 页）。 I can easily extract what I want from the first page including the first page in the start_urls list.我可以轻松地从第一页（包括start_urls列表中的第一页）中提取我想要的内容。 However I don't want to add all the links to these 50 pages to this list.但是，我不想将这 50 个页面的所有链接添加到此列表中。 I need a more dynamic way.我需要一种更动态的方式。 Does anyone know how I can iteratively scrape web pages?有谁知道我如何迭代抓取网页？ Does anyone have any examples of this?有没有人有这方面的例子？

Thanks!谢谢！

2 个解决方案

use urllib2 to download a page.使用 urllib2 下载页面。 Then use either re (regular expressions) or BeautifulSoup (an HTML parser) to find the link to the next page you need.然后使用 re（正则表达式）或 BeautifulSoup（一个 HTML 解析器）找到指向您需要的下一页的链接。 Download that with urllib2.用 urllib2 下载它。 Rinse and repeat.冲洗并重复。

Scapy is great, but you dont need it to do what you're trying to do Scapy 很棒，但你不需要它来做你想做的事

Why don't you want to add all the links to 50 pages?您为什么不想将所有链接添加到 50 个页面？ Are the URLs of the pages consecutive like www.site.com/page=1 , www.site.com/page=2 or are they all distinct?页面的 URL 是连续的，如www.site.com/page=1 ， www.site.com/page=2还是它们都不同？ Can you show me the code that you have now?你能告诉我你现在拥有的代码吗？