简体   繁体   中英

Select parent of specific node using xpath/python

How do I get the href value for the a in this snippet of html?

I need to get it based on that class in i tag

<a href="https://link.com" target="_blank"><i class="foobar"></i>  </a>           

I tried this, but am getting no results

foo_links = tree.xpath('//a[i/@class="foobar"]')

Your code does work for me — it returns a list of <a> . If you want a list of href s not the element itself, add /@href :

hrefs = tree.xpath('//a[i/@class="foobar"]/@href')

You could also first find the <i> s, then use /parent::* (or simply /.. ) to get back to the <a> s.

hrefs = tree.xpath('//a/i[@class="foobar"]/../@href')
#                     ^                    ^  ^
#                     |                    |  obtain the 'href'
#                     |                    |
#                     |                    get the parent of the <i>
#                     |
#                     find all <i class="foobar"> contained in an <a>.

If all of these don't work, you may want to verify if the structure of the document is correct.

Note that XPath won't peek inside comments <!-- --> . If the <a> is indeed inside the comments <!-- --> , you need to manually extract the document out first.

hrefs = [href for comment in tree.xpath('//comment()') 
              # find all comments
              for href in lxml.html.fromstring(comment.text)
              # parse content of comment as a new HTML file
                              # read those hrefs.

You should note that target element is HTML comment . You cannot simply get <a> from comment with XPath like "//a" as in this case it's not a node, but simple string.

Try below code:

import re

foo_links = tree.xpath('//comment()') # get list of all comments on page
for link in foo_links:
    if '<i class="foobar">' in link.text:
        href = re.search('\w+://\w+.\w+', link.text).group(0) # get href value from required comment

PS You might need to use more complex regular expression to match link URL

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM