如何使用scrapy抓取多个页面？

Question

All examples i found of Scrapy talk about how to crawl a single page, pages with the same url schema or all the pages of a website. 我发现Scrapy的所有示例都讨论了如何抓取单个页面，具有相同url架构的页面或网站的所有页面。 I need to crawl series of pages A, B, C where in A you got the link to B and so on.. For example the website structure is: 我需要抓取一系列页面A，B，C，其中在A中你得到了B的链接等等。例如，网站结构是：

A
----> B
---------> C
D
E

I need to crawl all the C pages, but to get link to C i need to crawl before A and B. Any hints? 我需要抓取所有C页面，但要获得C链接，我需要在A和B之前抓取。任何提示？

Answer 1

see scrapy Request structure , to crawl such chain you'll have to use the callback parameter like the following: 看scrapy请求结构，要抓取这样的链，你必须使用如下的回调参数：

class MySpider(BaseSpider):
    ...
    # spider starts here
    def parse(self, response):
        ...
        # A, D, E are done in parallel, A -> B -> C are done serially
        yield Request(url=<A url>,
                      ...
                      callback=parseA)
        yield Request(url=<D url>,
                      ...
                      callback=parseD)
        yield Request(url=<E url>,
                      ...
                      callback=parseE)

    def parseA(self, response):
        ...
        yield Request(url=<B url>,
                      ...
                      callback=parseB)

    def parseB(self, response):
        ...
        yield Request(url=<C url>,
                      ...
                      callback=parseC)

    def parseC(self, response):
        ...

    def parseD(self, response):
        ...

    def parseE(self, response):
        ...

Answer 2

Here is an example spider I wrote for a project of mine: 这是我为我的项目写的一个蜘蛛示例：

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from yoMamaSpider.items import JokeItem
from yoMamaSpider.striputils import stripcats, stripjokes
import re

class Jokes4UsSpider(CrawlSpider):
    name = 'jokes4us'
    allowed_domains = ['jokes4us.com']
    start_urls = ["http://www.jokes4us.com/yomamajokes/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        links = hxs.select('//a')
        for link in links:
            url = ''.join(link.select('./@href').extract())
            relevant_urls = re.compile(
                'http://www\.jokes4us\.com/yomamajokes/yomamas([a-zA-Z]+)')
            if relevant_urls.match(url):
                yield Request(url, callback=self.parse_page)

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)
        categories = stripcats(hxs.select('//title/text()').extract())
        joke_area = hxs.select('//p/text()').extract()
        for joke in joke_area:
            joke = stripjokes(joke)
            if len(joke) > 15:
                yield JokeItem(joke=joke, categories=categories)

I think the parse method is what you are after: It looks at every link on the start_urls page, it then uses some regex to decide if it is a relevant_url (ie a url i would like to scrape), if it is relevant - it scrapes the page using yield Request(url, callback=self.parse_page), which calls the parse_page method. 我认为解析方法就是你所追求的：它查看start_urls页面上的每个链接，然后使用一些正则表达式来决定它是否是一个related_url（即我想要抓取的网址），如果它是相关的 - 它使用yield Request（url，callback = self.parse_page）来擦除页面，该请求调用parse_page方法。

Is this the kind of thing you are after? 这是你追求的那种东西吗？

如何使用scrapy抓取多个页面？

问题描述

2 个解决方案

解决方案1
12 2013-12-16 00:36:02

解决方案2
6 2013-12-16 00:52:01

如何使用scrapy抓取多个页面？

问题描述

2 个解决方案

解决方案1 12 2013-12-16 00:36:02

解决方案2 6 2013-12-16 00:52:01

解决方案1
12 2013-12-16 00:36:02

解决方案2
6 2013-12-16 00:52:01