I've been trying to crawl recipe titles from food network and I want to recursively move to the next page. I'm using python 3 so some functions in scrapy are not available to me but here's what I have so far:
import scrapy
from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from testspider.items import testspiderItem
from lxml import html
class MySpider(CrawlSpider):
name = "test"
allowed_domains = ["foodnetwork.com"]
start_urls = ["http://www.foodnetwork.com/recipes/aarti-sequeira/middle-eastern-fire-roasted-eggplant-dip-babaganoush-recipe.html"]
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//div[@class="recipe-next"]/a/@href',)), callback="parse_page", follow= True),)
def parse(self, response):
site = html.fromstring(response.body_as_unicode())
titles = site.xpath('//h1[@itemprop="name"]/text()')
for title in titles:
item = testspiderItem()
item["title"] = title
yield item
The tags from the webpage source are:
<div class="recipe-next">
<a href="/recipes/food-network-kitchens/middle-eastern-eggplant-rounds-recipe.html">Next Recipe</a>
</div>
Any help would be appreciated it!
CrawlSpider uses the parse method itself, when you override it things stop working as expected, see the docs . To quote the docs
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
Also your code snippet doesn't show the source for your parse_page()
method.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.