简体   繁体   中英

Xpath python find node after specific text

Here is the HTML code:

<div id="someid">
    <h2>Specific text 1</h2>
    <a class="hyperlinks" href="link"> link1 inside specific text 1</a>
    <a class="hyperlinks" href="link"> link2 inside specific text 1</a>
    <a class="hyperlinks" href="link"> link3 inside specific text 1</a>

    <h2>Specific text 2</h2>
    <a class="hyperlinks" href="link"> link1 inside specific text 2</a>
    <a class="hyperlinks" href="link"> link2 inside specific text 2</a>
    <a class="hyperlinks" href="link"> link3 inside specific text 2</a>
    <a class="hyperlinks" href="link"> link4 inside specific text 2</a>

    <h2>Specific text 3</h2>
    <a class="hyperlinks" href="link"> link1 inside specific text 3</a>
    <a class="hyperlinks" href="link"> link2 inside specific text 3</a>         

</div>  

I have to distinctly find links under each "Specific text". The problem is that if I write the following code in python:

links = root.xpath("//div[@id='someid']//a")
for link in links:
    print link.attrib['href']

It prints ALL the links irrespective of "Specific Text x", Whereas I want something like:

print "link under Specific text:"+specific+" link:"+link.attrib['href']

Please suggest

I think you will need one XPath expression for each h2 specific text.

Given an h2 specific text, you can get its following adjacent a siblings by:

    //div[@id='someid']/h2[.='Specific text 1']
     /following-sibling::a[
      count( . | following-sibling::h2[1]/preceding-sibling::*)
      = count(following-sibling::h2[1]/preceding-sibling::*)
      and preceding-sibling::h2[1][.='Specific text 1']]
    |
    //div[@id='someid']/h2[.='Specific text 1' and not(following-sibling::h2[1])]
    /following-sibling::a"

The second //h2 selection handles the case where h2 is the last one.

The expression above just exploits the XPath 1.0 intersection formula:

$ns1[count(.|$ns2)=count($ns2)]

You can find a lot of resources about this method, lot of answers here at SO (check my answers also). I think it's not difficult to understand how to apply this formula, what is difficult is to understand when it must be applied.

Credits for the formul goes to @Michael Key. Just google it a bit .

My expression has been extended with additional predicates to handle your specific case and unified ( | ) with additional expression to handle last h2 .

You could use the starts-with(s, t) function of XPath 2.0 to build a matching condition of a h2 -value.

//div/h2[starts-with(text(), 'Specific text')]//a

I don't know any XPath 2.0 implementation for Python. So this will probably not work. But perhaps you can change the condition for your needs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM