简体   繁体   中英

Recursive crawling not working with Scrapy Spider

I've been trying to crawl recipe titles from food network and I want to recursively move to the next page. I'm using python 3 so some functions in scrapy are not available to me but here's what I have so far:

import scrapy
from scrapy.http                        import Request
from scrapy.contrib.spiders             import CrawlSpider, Rule
from scrapy.linkextractors              import LinkExtractor
from scrapy.selector                    import Selector
from scrapy.selector                    import HtmlXPathSelector
from testspider.items                   import testspiderItem
from lxml import html

    class MySpider(CrawlSpider):
        name        = "test"
        allowed_domains = ["foodnetwork.com"]
        start_urls  = ["http://www.foodnetwork.com/recipes/aarti-sequeira/middle-eastern-fire-roasted-eggplant-dip-babaganoush-recipe.html"]
        rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//div[@class="recipe-next"]/a/@href',)), callback="parse_page", follow= True),)

        def parse(self, response):
            site = html.fromstring(response.body_as_unicode())
            titles = site.xpath('//h1[@itemprop="name"]/text()')

            for title in titles:
                item = testspiderItem()
                item["title"] = title
                yield item

The tags from the webpage source are:

<div class="recipe-next">
    <a href="/recipes/food-network-kitchens/middle-eastern-eggplant-rounds-recipe.html">Next Recipe</a>
</div>

Any help would be appreciated it!

CrawlSpider uses the parse method itself, when you override it things stop working as expected, see the docs . To quote the docs

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

Also your code snippet doesn't show the source for your parse_page() method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM