<div class="region size2of3">
<h2>Mumbai</h2>
<strong>Fort</strong>
<div>Elphinstone building, Horniman Circle,</div>
<div>Veer Nariman Road, Fort</div>
<div>Mumbai 400001</div>
<div>Timings: 08:00-00:30 hrs (Mon-Sun)</div>
<div><br></div>
</div>
I want to exclude the "Timings: 08:00-00:30 hrs (Mon-Sun)" div tag while parsing.
Here's my code:
import scrapy
from job.items import StarbucksItem
class StarbucksSpider(scrapy.Spider):
name = "starbucks"
allowed_domains = ["starbucks.in"]
start_urls = ["http://www.starbucks.in/coffeehouse/store-locations/"]
def parse(self, response):
for sel in response.xpath('//div[@class="region size2of3"]'):
item = StarbucksItem()
item['title'] = sel.xpath('div/text()').extract()
yield item
I would use starts-with()
XPath function to get the div
element's text that starts with "Timings":
sel.xpath('.//div[starts-with(., "Timings")]/text()').extract()
Note that the HTML structure of the page doesn't make it easy to distinguish locations between each other - there is no location-specific containers that you can iterate over. In this case, I would find every h2
or strong
tag and use following-sibling
, example from the Scrapy Shell :
In [10]: for sel in response.xpath('//div[contains(@class, "region")]/*[self::h2 or self::strong]'):
name = sel.xpath('text()').extract()[0]
timings = sel.xpath('./following-sibling::div[starts-with(., "Timings")]/text()').extract()[0]
print name, timings
....:
Mumbai Timings: 08:00-00:30 hrs (Mon-Sun)
Fort Timings: 08:00-00:30 hrs (Mon-Sun)
Colaba Timings: 07:00-01:00 hrs (Mon-Sun)
Goregaon Timings: 10:00-23:30 hrs (Mon-Sun)
Powai Timings: 07:00-00:00 hrs (Mon-Sun)
...
Hi-Tech City Timings: 09:00 - 22:30 hrs (Mon - Sun)
Madhapur Timings: 11:00 -23:00 hrs (Mon - Sun)
Banjara Hills Timings: 10:00 -22:30 hrs (Mon - Sun)
Also note that, if you want to extract the time range values, you can use .re()
:
In [18]: for sel in response.xpath('//div[contains(@class, "region")]/*[self::h2 or self::strong]'):
name = sel.xpath('text()').extract()[0]
timings = sel.xpath('./following-sibling::div[starts-with(., "Timings")]/text()')[0].re(r'(\d+:\d+)\s*\-\s*(\d+:\d+)')[:2]
print name, timings
Mumbai [u'08:00', u'00:30']
Fort [u'08:00', u'00:30']
Colaba [u'07:00', u'01:00']
Goregaon [u'10:00', u'23:30']
...
Hi-Tech City [u'09:00', u'22:30']
Madhapur [u'11:00', u'23:00']
Banjara Hills [u'10:00', u'22:30']
Additionally, make sure you have yield
inside the loop body (see the code you've posted).
If you want to exclude Timings
and get the rest of the location description, use:
for sel in response.xpath('//div[contains(@class, "region")]/*[self::h2 or self::strong]'):
print " ".join(item.strip() for item in sel.xpath('following-sibling::div[position() < 4 and not(starts-with(., "Timings"))]/text()').extract())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.