Scrapy format breadcrumbs using xpath

Question

I am crawling through a site using Scrapy and I want to format the extracted breadcrumbs to create a site path:

HTML:

<ul id="breadcrumbs"><li><a href="/site/ID/home">Home</a></li> <li><a href="/site/ID/AboutUs">Who We Are</a></li></ul>

What am doing:

breadcrumb = response.xpath("//ul[@id='breadcrumbs']")[0].extract()

What I get right now:

<ul id="breadcrumbs"><li><a href="/site/ID/home">Home</a></li> <li><a href="/site/ID/AboutUs">Who We Are</a></li></ul>

What I really need:

/home/AboutUs/

Any idea how I should write the xpath or how I should format the results?

Answer 1

Get all the href values using //ul[@id="breadcrumbs"]/li/a/@href xpath, extract the endings using .re() and join them.

Example from the scrapy shell :

$ scrapy shell index.html 
>>> ''.join(response.xpath('//ul[@id="breadcrumbs"]/li/a/@href').re(r'^.*?(/\w+)$'))
u'/home/AboutUs'

^.*?(/\\w+)$ would match any characters followed ( ? means it is a "non-greedy" matching type) by a slash followed by one or more alphanumeric characters (and _ ). Parenthesis help to capture the last part of the string (slash and aplhanumeric characters). ^ and $ are the beginning and the ending of the string accordingly.

Scrapy format breadcrumbs using xpath

Question

1 answers

solution1
2 ACCPTED 2014-10-31 19:28:52

Scrapy format breadcrumbs using xpath

Question

1 answers

solution1 2 ACCPTED 2014-10-31 19:28:52

solution1
2 ACCPTED 2014-10-31 19:28:52