简体   繁体   中英

Scrapy format breadcrumbs using xpath

I am crawling through a site using Scrapy and I want to format the extracted breadcrumbs to create a site path:

HTML:

<ul id="breadcrumbs"><li><a href="/site/ID/home">Home</a></li> <li><a href="/site/ID/AboutUs">Who We Are</a></li></ul>

What am doing:

breadcrumb = response.xpath("//ul[@id='breadcrumbs']")[0].extract()

What I get right now:

<ul id="breadcrumbs"><li><a href="/site/ID/home">Home</a></li> <li><a href="/site/ID/AboutUs">Who We Are</a></li></ul>

What I really need:

/home/AboutUs/

Any idea how I should write the xpath or how I should format the results?

Get all the href values using //ul[@id="breadcrumbs"]/li/a/@href xpath, extract the endings using .re() and join them.

Example from the scrapy shell :

$ scrapy shell index.html 
>>> ''.join(response.xpath('//ul[@id="breadcrumbs"]/li/a/@href').re(r'^.*?(/\w+)$'))
u'/home/AboutUs'

^.*?(/\\w+)$ would match any characters followed ( ? means it is a "non-greedy" matching type) by a slash followed by one or more alphanumeric characters (and _ ). Parenthesis help to capture the last part of the string (slash and aplhanumeric characters). ^ and $ are the beginning and the ending of the string accordingly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM