[英]How to exclude a particular html tag(without any id) from several tags while using scrapy?
<div class="region size2of3">
<h2>Mumbai</h2>
<strong>Fort</strong>
<div>Elphinstone building, Horniman Circle,</div>
<div>Veer Nariman Road, Fort</div>
<div>Mumbai 400001</div>
<div>Timings: 08:00-00:30 hrs (Mon-Sun)</div>
<div><br></div>
</div>
I want to exclude the "Timings: 08:00-00:30 hrs (Mon-Sun)" div tag while parsing. 我想在解析时排除“ Timings:08:00-00:30 hrs(Mon-Sun)” div标签。
Here's my code: 这是我的代码:
import scrapy
from job.items import StarbucksItem
class StarbucksSpider(scrapy.Spider):
name = "starbucks"
allowed_domains = ["starbucks.in"]
start_urls = ["http://www.starbucks.in/coffeehouse/store-locations/"]
def parse(self, response):
for sel in response.xpath('//div[@class="region size2of3"]'):
item = StarbucksItem()
item['title'] = sel.xpath('div/text()').extract()
yield item
I would use starts-with()
XPath function to get the div
element's text that starts with "Timings": 我将使用
starts-with()
XPath函数获取以“ Timings”开头的div
元素的文本:
sel.xpath('.//div[starts-with(., "Timings")]/text()').extract()
Note that the HTML structure of the page doesn't make it easy to distinguish locations between each other - there is no location-specific containers that you can iterate over. 请注意,页面的HTML结构并不容易区分彼此之间的位置-没有可重复使用的特定于位置的容器。 In this case, I would find every
h2
or strong
tag and use following-sibling
, example from the Scrapy Shell : 在这种情况下,我会找到每个
h2
或strong
标签,并使用Scrapy Shell中的 following-sibling
例子:
In [10]: for sel in response.xpath('//div[contains(@class, "region")]/*[self::h2 or self::strong]'):
name = sel.xpath('text()').extract()[0]
timings = sel.xpath('./following-sibling::div[starts-with(., "Timings")]/text()').extract()[0]
print name, timings
....:
Mumbai Timings: 08:00-00:30 hrs (Mon-Sun)
Fort Timings: 08:00-00:30 hrs (Mon-Sun)
Colaba Timings: 07:00-01:00 hrs (Mon-Sun)
Goregaon Timings: 10:00-23:30 hrs (Mon-Sun)
Powai Timings: 07:00-00:00 hrs (Mon-Sun)
...
Hi-Tech City Timings: 09:00 - 22:30 hrs (Mon - Sun)
Madhapur Timings: 11:00 -23:00 hrs (Mon - Sun)
Banjara Hills Timings: 10:00 -22:30 hrs (Mon - Sun)
Also note that, if you want to extract the time range values, you can use .re()
: 另请注意,如果要提取时间范围值,则可以使用
.re()
:
In [18]: for sel in response.xpath('//div[contains(@class, "region")]/*[self::h2 or self::strong]'):
name = sel.xpath('text()').extract()[0]
timings = sel.xpath('./following-sibling::div[starts-with(., "Timings")]/text()')[0].re(r'(\d+:\d+)\s*\-\s*(\d+:\d+)')[:2]
print name, timings
Mumbai [u'08:00', u'00:30']
Fort [u'08:00', u'00:30']
Colaba [u'07:00', u'01:00']
Goregaon [u'10:00', u'23:30']
...
Hi-Tech City [u'09:00', u'22:30']
Madhapur [u'11:00', u'23:00']
Banjara Hills [u'10:00', u'22:30']
Additionally, make sure you have yield
inside the loop body (see the code you've posted). 此外,请确保您在循环体内具有
yield
(请参见发布的代码)。
If you want to exclude Timings
and get the rest of the location description, use: 如果要排除
Timings
并获取其余的位置描述,请使用:
for sel in response.xpath('//div[contains(@class, "region")]/*[self::h2 or self::strong]'):
print " ".join(item.strip() for item in sel.xpath('following-sibling::div[position() < 4 and not(starts-with(., "Timings"))]/text()').extract())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.