简体   繁体   中英

get text from html using lxml

I'm trying to get the list of celebrity names from this site using Xpath from lxml, but having trouble.

Here is the HTML

<div class="lists">
            <dl> <dt>A</dt> <dd><a href="/people/adam_levine/" id="20608779">Adam Levine</a>    </dd>

And I want to get the text Adam Levine

My code in python is...

celebs = tree.xpath('//dd[a]/following-sibling::node()')

But my result is Element dd at 0x1084ad4c8>...

If anyone could help that would be great. Thanks

Extract the text with text() , not the following-sibling::node() , like this:

from lxml import etree

# your HTML is invalid, I have purposefully put the </dl> and </div> closing tags
s = '''<div class="lists">
            <dl> <dt>A</dt> <dd><a href="/people/adam_levine/" id="20608779">Adam Levine</a>    </dd></dl></div>'''

tree = etree.fromstring(s)

tree.xpath('.//dd/a/text()')
['Adam Levine']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM