简体   繁体   中英

Using Python lxml.html how can I find images within link tags?

I am using lxml.html to parse some hmtl to get links, however when it hits a link which contains an image it just returns blank, what it'd really like is to be able to detect if it's an image, and then try and return the image alt text.

So it looks like this...

from lxml.html import parse, fromstring

doc = fromstring('<a href="Link One">Anchor Link One</a><br /><a href="Link Two"<img src="Image Link Two" alt="Alt Image" /></a><br /><a href="Link Three">Anchor Link Three</a><br />')
for link in doc.cssselect('a'):
    print '%s: %s' % (link.text_content(), link.get('href'))

result

Anchor Link One: Link One
: Link Two
Anchor Link Three: Link Three

So I tried using .html_content() to try and get the raw html and then check if that was an image.

Hmm.. How to detect if wrapped in image, and/or pull out the html there....

Just modify your css selector:

for img in doc.cssselect('a img'):

You can also use an XPATH expression:

for img in doc.xpath('a//img'):
for link in doc.xpath('a'):
    img = link.find('img')
    if img is not None:
        print '%s: %s' % (img.get('alt'), link.get('href'))
    else:
        print '%s: %s' % (link.text_content(), link.get('href'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM