I have scraped a page with this html content:
<div class="td-ss-main-content"> <div class="td-page-header">...</div> <div class="td_module_16 td_module_wrap td-animation-stack">...</div> <div class="td_module_16 td_module_wrap td-animation-stack td_module_no_thumb">...</div> <div class="page-nav td-pb-padding-side"> <span class="current">1</span> <a href="http://www.arunachaltimes.in/2017/05/06/page/2/" class="page" title="2">2</a> <a href="http://www.arunachaltimes.in/2017/05/06/page/3/" class="page" title="3">3</a> <a href="http://www.arunachaltimes.in/2017/05/06/page/2/"><i class="td-icon-menu-right"></i></a> <span class="pages">Page 1 of 3</span> </div> </div>
Now I would like to get the next page link if its present which is in the a href value of .page-nav > a
which has an i tag
.
I can do this:
response.css("div.page-nav > a")[2].css("::attr(href)").extract_first()
But this won't work if I am on page 2. So it is better to get the value of a tag
if it has a child element of an i tag
. How can I achieve that?
update (page 2)
<div class="page-nav td-pb-padding-side">
<a href="http://www.arunachaltimes.in/2017/05/06/"><i class="td-icon-menu-left"></i></a>
<a href="http://www.arunachaltimes.in/2017/05/06/" class="page" title="1">1</a>
<span class="current">2</span>
<a href="http://www.arunachaltimes.in/2017/05/06/page/3/" class="page" title="3">3</a>
<a href="http://www.arunachaltimes.in/2017/05/06/page/3/"><i class="td-icon-menu-right"></i></a>
<span class="pages">Page 2 of 3</span>
</div>
update (page 3 last page)
<div class="page-nav td-pb-padding-side">
<a href="http://www.arunachaltimes.in/2017/05/06/page/2/"><i class="td-icon-menu-left"></i></a>
<a href="http://www.arunachaltimes.in/2017/05/06/" class="page" title="1">1</a>
<a href="http://www.arunachaltimes.in/2017/05/06/page/2/" class="page" title="2">2</a>
<span class="current">3</span>
<span class="pages">Page 3 of 3</span>
</div>
You can achieve it with an XPath expression:
//div[contains(concat(' ', @class, ' '), ' page-nav ')]/a[contains(concat(' ', i/@class, ' '), ' td-icon-menu-right ')]/@href
Note that, to avoid false positives, we are using concat
for the class
attribute check .
Demo:
$ scrapy shell file:////$PWD/index.html
In [1]: response.xpath("//div[contains(concat(' ', @class, ' '), ' page-nav ')]/a[contains(concat(' ', i/@class, ' '), ' td-icon-menu-right ')]/@href").extract_first()
Out[1]: u'http://www.arunachaltimes.in/2017/05/06/page/2/'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.