简体   繁体   English

Xpath在控制台中正确定位html元素,但在scrapy响应中使用时返回空数组

[英]Xpath locates html element correctly in console but returns empty array when used in scrapy response

I created a webscraper using the Scrapy Framework to get concert ticket data from this website . 我使用Scrapy Framework创建了一个Webscraper,以从该网站获取音乐会门票数据。 I have been able to successfully scrape data for a few selectors which are essentially just html text, but a few other selectors are collecting anything. 我已经能够成功地抓取一些选择器的数据,这些选择器本质上只是html文本,但是其他一些选择器正在收集任何内容。 When I try to scrape the concert date from each ticket, an empty array is returned in the response despite the fact that the xpath I use returns all of the correct dates when it is run in the developers console. 当我尝试从每张票中刮起演唱会日期时,尽管我使用的xpath在开发人员控制台中运行时都会返回所有正确的日期,但响应中仍返回一个空数组。 Is there something wrong with the way that I define the item in the class definition. 我在类定义中定义项目的方式有问题吗? Any help would be greatly appreciated: 任何帮助将不胜感激:

from scrapy.contrib.spiders import CrawlSpider 
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from concert_comparator.items import ComparatorItem

bandname = raw_input("Enter a bandname \n")
vs_url = "http://www.vividseats.com/concerts/" + bandname + "-tickets.html"

class MySpider(CrawlSpider):
    handle_httpstatus_list = [416]
    name = 'comparator'
    allowed_domains = ["www.vividseats.com"]
    start_urls = [vs_url]
    #rules = (Rule(LinkExtractor(allow=('-tickets/.*', )), callback='parse_item'))
    # item = ComparatorItem()
    tickets_list_xpath = './/*[@itemtype="http://schema.org/Event"]'
    item_fields = {
        'eventName' : './/*[@class="productionsEvent"]/text()',
        #'ticketPrice' : '//*[@class="eventTickets lastChild"]/div/div/@data-origin-price',
        'eventLocation' : './/*[@class = "productionsVenue"]/span[@itemprop  = "name"]/text()',
        'ticketsLink' : './/a/@href',
        #returns empty set
        'eventDate' : './/*[@class = "productionsDateCol productionsDateCol sorting_3"]/div[@class = "productionsDate"]/text()',
        'eventCity' : './/*[@class = "productionsVenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addressLocality"]/text()',
        'eventState' : './/*[@class = "productionsVenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addressRegion"]/text()',
        #returns empty set
        'eventTime' : './/*[@class = "productionsDateCol productionsDateCol sorting_3"]/div[@class = "productionsTime"]/text()'
    }
    def parse(self, response):
        selector = HtmlXPathSelector(response)
        # iterate over tickets
        for ticket in selector.select(self.tickets_list_xpath):

            loader = XPathItemLoader(ComparatorItem(), selector=ticket)
            # define loader
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()
            # iterate over fields and add xpaths to the loader
            for field, xpath in self.item_fields.iteritems():
                loader.add_xpath(field, xpath)
            yield loader.load_item()

Not exactly sure why, but after some trial and error, I found the correct xpaths to use. 不确定原因,但经过反复试验,我找到了要使用的正确xpath。 By simply using the class assignment statement from the tag where I was trying to extract text I was able to scrape the elements for all of the tickets on the page. 通过简单地使用我尝试提取文本的标记中的类赋值语句,我便可以为页面上的所有票证刮取元素。
Eg eventDate: './/*[@class = "productionsDate"]/text()' 例如eventDate:'.//*[@class =“ productionsDate”] / text()'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM