简体   繁体   English

Scrapy Crawl Spider没有关注链接

[英]Scrapy Crawl Spider not following links

So i wrote a web crawler to extract food items from walmart.com. 所以我写了一个网络爬虫来从walmart.com中提取食物。 Here is my spider. 这是我的蜘蛛。 I cant seem to figure out why it does not follow the links on the left until. 我似乎无法弄清楚为什么它不遵循左边的链接,直到。 It pulls the main page then finishes. 它拉出主页然后完成。

My intended goal is for it to follow all the links on the left flyout bar then exract each food item from those pages. 我的目标是让它跟随左侧弹出栏上的所有链接,然后从这些页面中提取每个食物项目。

I even tried just using allow=() so that it follows every link on the page but that still does not work. 我甚至尝试使用allow =(),以便它跟随页面上的每个链接,但仍然无效。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle
from walmart_scraper.items import GroceryItem


class WalmartFoodSpider(CrawlSpider):
    name = "walmart_scraper"
    allowed_domains = ["www.walmart.com"]
    start_urls = ["http://www.walmart.com/cp/976759"]
    rules = (Rule(sle(restrict_xpaths=('//div[@class="lhn-menu-flyout-inner lhn-menu-flyout-2col"]/ul[@class="block-list"]/li/a',)),callback='parse',follow=True),)

    items_list_xpath = '//div[@class="js-tile tile-grid-unit"]'

item_fields = {'title': './/a[@class="js-product-title"]/h3[@class="tile-heading"]/div',
               'image_url': './/a[@class="js-product-image"]/img[@class="product-image"]/@src',
               'price': './/div[@class="tile-price"]/div[@class="item-price-            container"]/span[@class="price price-display"]|//div[@class="tile-price"]/div[@class="item-price-   container"]/span[@class="price price-display price-not-available"]',
               'category': '//nav[@id="breadcrumb-container"]/ol[@class="breadcrumb-list"]/li[@class="js-breadcrumb breadcrumb "][2]/a',
               'subcategory': '//nav[@id="breadcrumb-container"]/ol[@class="breadcrumb-list"]/li[@class="js-breadcrumb breadcrumb active"]/a',
               'url': './/a[@class="js-product-image"]/@href'}
def parse(self, response):

    selector = HtmlXPathSelector(response)

    # iterate over deals
    for item in selector.select(self.items_list_xpath):
        loader = XPathItemLoader(GroceryItem(), selector=item)

        # define processors
        loader.default_input_processor = MapCompose(unicode.strip)
        loader.default_output_processor = Join()

        # iterate over fields and add xpaths to the loader
        for field, xpath in self.item_fields.iteritems():
            loader.add_xpath(field, xpath)
        yield loader.load_item()

You are not supposed to override the parse() method when using the CrawlSpider . 使用CrawlSpider时,不应该覆盖parse()方法。 You should set a custom callback in your Rule with a different name. 您应该在Rule使用其他名称设置自定义callback
Here is the excerpt from the official documentation : 以下是官方文档的摘录:

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. 编写爬网蜘蛛规则时,请避免使用parse作为回调,因为CrawlSpider使用parse方法本身来实现其逻辑。 So if you override the parse method, the crawl spider will no longer work. 因此,如果您覆盖解析方法,则爬网蜘蛛将不再起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM