简体   繁体   English

如何在使用scrapy时从多个标签中排除特定的html标签(无任何ID)?

[英]How to exclude a particular html tag(without any id) from several tags while using scrapy?

<div class="region size2of3">
<h2>Mumbai</h2>
<strong>Fort</strong>
<div>Elphinstone building, Horniman Circle,</div>
<div>Veer Nariman Road, Fort</div>
<div>Mumbai 400001</div>
<div>Timings: 08:00-00:30 hrs (Mon-Sun)</div>
<div><br></div>
</div>

I want to exclude the "Timings: 08:00-00:30 hrs (Mon-Sun)" div tag while parsing. 我想在解析时排除“ Timings:08:00-00:30 hrs(Mon-Sun)” div标签。

Here's my code: 这是我的代码:

import scrapy
from job.items import StarbucksItem

class StarbucksSpider(scrapy.Spider):
    name = "starbucks"
    allowed_domains = ["starbucks.in"]
    start_urls = ["http://www.starbucks.in/coffeehouse/store-locations/"]

    def parse(self, response):
        for sel in response.xpath('//div[@class="region size2of3"]'):
            item = StarbucksItem()
            item['title'] = sel.xpath('div/text()').extract()
        yield item

I would use starts-with() XPath function to get the div element's text that starts with "Timings": 我将使用starts-with() XPath函数获取以“ Timings”开头的div元素的文本:

sel.xpath('.//div[starts-with(., "Timings")]/text()').extract()

Note that the HTML structure of the page doesn't make it easy to distinguish locations between each other - there is no location-specific containers that you can iterate over. 请注意,页面的HTML结构并不容易区分彼此之间的位置-没有可重复使用的特定于位置的容器。 In this case, I would find every h2 or strong tag and use following-sibling , example from the Scrapy Shell : 在这种情况下,我会找到每个h2strong标签,并使用Scrapy Shell中的 following-sibling例子:

In [10]: for sel in response.xpath('//div[contains(@class, "region")]/*[self::h2 or self::strong]'):
            name = sel.xpath('text()').extract()[0]
            timings = sel.xpath('./following-sibling::div[starts-with(., "Timings")]/text()').extract()[0]
            print name, timings
   ....:     
Mumbai Timings: 08:00-00:30 hrs (Mon-Sun)
Fort Timings: 08:00-00:30 hrs (Mon-Sun)
Colaba Timings: 07:00-01:00 hrs (Mon-Sun)
Goregaon Timings: 10:00-23:30 hrs (Mon-Sun)
Powai Timings: 07:00-00:00 hrs (Mon-Sun)
...
Hi-Tech City Timings: 09:00 - 22:30 hrs (Mon - Sun)
Madhapur Timings: 11:00 -23:00 hrs (Mon - Sun)
Banjara Hills Timings: 10:00 -22:30 hrs (Mon - Sun)

Also note that, if you want to extract the time range values, you can use .re() : 另请注意,如果要提取时间范围值,则可以使用.re()

In [18]: for sel in response.xpath('//div[contains(@class, "region")]/*[self::h2 or self::strong]'):
        name = sel.xpath('text()').extract()[0]
        timings = sel.xpath('./following-sibling::div[starts-with(., "Timings")]/text()')[0].re(r'(\d+:\d+)\s*\-\s*(\d+:\d+)')[:2]
        print name, timings
Mumbai [u'08:00', u'00:30']
Fort [u'08:00', u'00:30']
Colaba [u'07:00', u'01:00']
Goregaon [u'10:00', u'23:30']
...
Hi-Tech City [u'09:00', u'22:30']
Madhapur [u'11:00', u'23:00']
Banjara Hills [u'10:00', u'22:30']

Additionally, make sure you have yield inside the loop body (see the code you've posted). 此外,请确保您在循环体内具有yield (请参见发布的代码)。


If you want to exclude Timings and get the rest of the location description, use: 如果要排除Timings并获取其余的位置描述,请使用:

for sel in response.xpath('//div[contains(@class, "region")]/*[self::h2 or self::strong]'):
    print " ".join(item.strip() for item in sel.xpath('following-sibling::div[position() < 4 and not(starts-with(., "Timings"))]/text()').extract())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM