简体   繁体   English

使用lxml从html获取文本

[英]get text from html using lxml

I'm trying to get the list of celebrity names from this site using Xpath from lxml, but having trouble. 我正在尝试使用来自lxml的Xpath从该站点获取名人姓名列表,但遇到了麻烦。

Here is the HTML 这是HTML

<div class="lists">
            <dl> <dt>A</dt> <dd><a href="/people/adam_levine/" id="20608779">Adam Levine</a>    </dd>

And I want to get the text Adam Levine 我想得到文本亚当·莱文

My code in python is... 我在python中的代码是...

celebs = tree.xpath('//dd[a]/following-sibling::node()')

But my result is Element dd at 0x1084ad4c8>... 但是我的结果是元素dd在0x1084ad4c8> ...

If anyone could help that would be great. 如果有人可以帮助,那就太好了。 Thanks 谢谢

Extract the text with text() , not the following-sibling::node() , like this: 使用text()而不是following-sibling::node()提取文本,如下所示:

from lxml import etree

# your HTML is invalid, I have purposefully put the </dl> and </div> closing tags
s = '''<div class="lists">
            <dl> <dt>A</dt> <dd><a href="/people/adam_levine/" id="20608779">Adam Levine</a>    </dd></dl></div>'''

tree = etree.fromstring(s)

tree.xpath('.//dd/a/text()')
['Adam Levine']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM