简体   繁体   English

如何使用scrapy.Request将元素从另一个页面加载到一个项目中

[英]How to use scrapy.Request to load an element from another page into an item

I've created a web scraper using Scrapy that is able to scrape elements from each ticket from this website but cannot scrape the ticket price since it isn't available on the page. 我使用Scrapy创建了一个网络刮板,它能够从这个网站上的每个票据中搜集元素,但由于页面上没有,所以不能刮取票价。 When I try to request the next page to scrape the price, I am unable to and get the error: exceptions.TypeError: 'XPathItemLoader' object has no attribute ' getitem '. 当我尝试请求下一页来降低价格时,我无法得到错误:exceptions.TypeError:'XPathItemLoader'对象没有属性' getitem '。 I have only been able to scrape any elements using item loaders so that's what I am currently using and I'm not exactly sure the correct procedure for passing scraped elements on another page to the item loader ( I have seen one way to do it with the item data type but it didn't apply here). 我只能使用项目加载器来抓取任何元素,这就是我目前正在使用的内容,并且我不确定将另一个页面上的已删除元素传递给项目加载器的正确过程(我已经看到了一种方法来实现它项目数据类型,但它不适用于此处)。 I think I possibly could have been having problems extracting elements into an item object because I am pipelining into a database, but I'm not sure exactly. 我想我可能在将元素提取到项目对象时遇到问题,因为我正在流水线化到数据库中,但我不确定。 If the code I post below could be modified in order to properly crawl to the next page, scrape the price, and add it to the item loader, I think the problem should be solved. 如果我下面发布的代码可以修改,以便正确爬行到下一页,刮掉价格,并将其添加到项目加载器,我认为应该解决问题。 Any help will be appreciated. 任何帮助将不胜感激。 Thanks! 谢谢!

 class MySpider(CrawlSpider):
    handle_httpstatus_list = [416]
    name = 'comparator'
    allowed_domains = ["www.vividseats.com"]
    start_urls = [vs_url]
    tickets_list_xpath = './/*[@itemtype="http://schema.org/Event"]'
    def parse_price(self, response):
        #First attempt at trying to load price into item loader
        loader.add_xpath('ticketPrice' , '//*[@class="eventTickets lastChild"]/div/div/@data-origin-price')
        print 'ticket price'
    def parse(self, response):
        selector = HtmlXPathSelector(response)
        # iterate over tickets
        for ticket in selector.select(self.tickets_list_xpath):

            loader = XPathItemLoader(ComparatorItem(), selector=ticket)
            # define loader
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()
            # iterate over fields and add xpaths to the loader

            loader.add_xpath('eventName' , './/*[@class="productionsEvent"]/text()')
            loader.add_xpath('eventLocation' , './/*[@class = "productionsVenue"]/span[@itemprop  = "name"]/text()')
            loader.add_xpath('ticketsLink' , './/*/td[3]/a/@href')
            loader.add_xpath('eventDate' , './/*[@class = "productionsDate"]/text()')
            loader.add_xpath('eventCity' , './/*[@class = "productionsVenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addressLocality"]/text()')
            loader.add_xpath('eventState' , './/*[@class = "productionsVenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addressRegion"]/text()')
            loader.add_xpath('eventTime' , './/*[@class = "productionsTime"]/text()')

            ticketsURL = "concerts/" + bandname + "-tickets/" + bandname + "-" + loader["ticketsLink"]
            request = scrapy.Request(ticketsURL , callback = self.parse_price)
            yield loader.load_item()

Key things to fix: 要解决的关键问题:

  • to get the value from an item loader, use get_output_value() , replace: 要从项加载器获取值,请使用get_output_value() ,替换:

     loader["ticketsLink"] 

    with: 有:

     loader.get_output_value("ticketsLink") 
  • you need to pass the loader inside the meta of the request and yield/return the loaded item there 你需要在请求的meta中传递loader并在那里产生/返回加载的项目

  • when constructing the URL to get the price, use urljoin() to join the relative part with the current URL 在构造URL以获取价格时,使用urljoin()将相对部分与当前URL连接起来

Here is the fixed version: 这是固定版本:

from urlparse import urljoin
# other imports

class MySpider(CrawlSpider):
    handle_httpstatus_list = [416]
    name = 'comparator'
    allowed_domains = ["www.vividseats.com"]
    start_urls = [vs_url]
    tickets_list_xpath = './/*[@itemtype="http://schema.org/Event"]'
    def parse_price(self, response):
        loader = response.meta['loader']
        loader.add_xpath('ticketPrice' , '//*[@class="eventTickets lastChild"]/div/div/@data-origin-price')
        return loader.load_item()

    def parse(self, response):
        selector = HtmlXPathSelector(response)
        # iterate over tickets
        for ticket in selector.select(self.tickets_list_xpath):

            loader = XPathItemLoader(ComparatorItem(), selector=ticket)
            # define loader
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()
            # iterate over fields and add xpaths to the loader

            loader.add_xpath('eventName' , './/*[@class="productionsEvent"]/text()')
            loader.add_xpath('eventLocation' , './/*[@class = "productionsVenue"]/span[@itemprop  = "name"]/text()')
            loader.add_xpath('ticketsLink' , './/*/td[3]/a/@href')
            loader.add_xpath('eventDate' , './/*[@class = "productionsDate"]/text()')
            loader.add_xpath('eventCity' , './/*[@class = "productionsVenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addressLocality"]/text()')
            loader.add_xpath('eventState' , './/*[@class = "productionsVenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addressRegion"]/text()')
            loader.add_xpath('eventTime' , './/*[@class = "productionsTime"]/text()')

            ticketsURL = "concerts/" + bandname + "-tickets/" + bandname + "-" + loader.get_output_value("ticketsLink")
            ticketsURL = urljoin(response.url, ticketsURL)
            yield scrapy.Request(ticketsURL, meta={'loader': loader}, callback = self.parse_price)

I have an exact problem and solved it in another post. 我有一个确切的问题,并在另一篇文章中解决了它。 I put my code here to share: (my original post is here ) 我把我的代码放在这里分享:(我原来的帖子在这里

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy import Request
import re
from datetime import datetime, timedelta
from CAPjobs.items import CAPjobsItem 
from CAPjobs.items import CAPjobsItemLoader

class CAPjobSpider(Spider):
    name = "naturejob3"
    download_delay = 2
    #allowed_domains = ["nature.com/naturejobs/"]
    start_urls = [
"http://www.nature.com/naturejobs/science/jobs?utf8=%E2%9C%93&q=pathologist&where=&commit=Find+Jobs"]

    def parse_subpage(self, response):
        il = response.meta['il']
        location = response.xpath('//div[@id="extranav"]//ul[@class="job-addresses"]/li/text()').extract()
        il.add_value('loc_pj', location)  
        yield il.load_item()

    def parse(self, response):
        hxs = Selector(response)
        sites = hxs.xpath('//div[@class="job-details"]')    

        for site in sites:

            il = CAPjobsItemLoader(CAPjobsItem(), selector = site) 
            il.add_xpath('title', 'h3/a/text()')
            il.add_xpath('post_date', 'normalize-space(ul/li[@class="when"]/text())')
            il.add_xpath('web_url', 'concat("http://www.nature.com", h3/a/@href)')
            url = il.get_output_value('web_url')
            yield Request(url, meta={'il': il}, callback=self.parse_subpage)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM