How do I get the href value for the a in this snippet of html?
I need to get it based on that class in i tag
<!--
<a href="https://link.com" target="_blank"><i class="foobar"></i> </a>
-->
I tried this, but am getting no results
foo_links = tree.xpath('//a[i/@class="foobar"]')
Your code does work for me — it returns a list of <a>
. If you want a list of href
s not the element itself, add /@href
:
hrefs = tree.xpath('//a[i/@class="foobar"]/@href')
You could also first find the <i>
s, then use /parent::*
(or simply /..
) to get back to the <a>
s.
hrefs = tree.xpath('//a/i[@class="foobar"]/../@href')
# ^ ^ ^
# | | obtain the 'href'
# | |
# | get the parent of the <i>
# |
# find all <i class="foobar"> contained in an <a>.
If all of these don't work, you may want to verify if the structure of the document is correct.
Note that XPath won't peek inside comments <!-- -->
. If the <a>
is indeed inside the comments <!-- -->
, you need to manually extract the document out first.
hrefs = [href for comment in tree.xpath('//comment()')
# find all comments
for href in lxml.html.fromstring(comment.text)
# parse content of comment as a new HTML file
.xpath('//a[i/@class="foobar"]/@href')
# read those hrefs.
]
You should note that target element is HTML
comment . You cannot simply get <a>
from comment with XPath
like "//a"
as in this case it's not a node, but simple string.
Try below code:
import re
foo_links = tree.xpath('//comment()') # get list of all comments on page
for link in foo_links:
if '<i class="foobar">' in link.text:
href = re.search('\w+://\w+.\w+', link.text).group(0) # get href value from required comment
break
PS You might need to use more complex regular expression to match link URL
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.