I am crawling through a site using Scrapy and I want to format the extracted breadcrumbs to create a site path:
HTML:
<ul id="breadcrumbs"><li><a href="/site/ID/home">Home</a></li> <li><a href="/site/ID/AboutUs">Who We Are</a></li></ul>
What am doing:
breadcrumb = response.xpath("//ul[@id='breadcrumbs']")[0].extract()
What I get right now:
<ul id="breadcrumbs"><li><a href="/site/ID/home">Home</a></li> <li><a href="/site/ID/AboutUs">Who We Are</a></li></ul>
What I really need:
/home/AboutUs/
Any idea how I should write the xpath or how I should format the results?
Get all the href
values using //ul[@id="breadcrumbs"]/li/a/@href
xpath, extract the endings using .re()
and join
them.
Example from the scrapy shell
:
$ scrapy shell index.html
>>> ''.join(response.xpath('//ul[@id="breadcrumbs"]/li/a/@href').re(r'^.*?(/\w+)$'))
u'/home/AboutUs'
^.*?(/\\w+)$
would match any characters followed ( ?
means it is a "non-greedy" matching type) by a slash followed by one or more alphanumeric characters (and _
). Parenthesis help to capture the last part of the string (slash and aplhanumeric characters). ^
and $
are the beginning and the ending of the string accordingly.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.