[英]ElementTree text mixed with tags
imagine the following text: 想象以下文本:
<description>
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>.
</description>
How would I manage to parse this with the etree
interface? 我将如何使用etree
接口解析此etree
? Having the description
tag, the .text
property returns only the first word - the
. 具有description
标签时, .text
属性仅返回的第一个字- the
。 The .getchildren()
method returns the <b>
elements, but not the rest of the text. .getchildren()
方法返回<b>
元素,但不返回其余文本。
Many thanks! 非常感谢!
Get the .text_content()
. 获取.text_content()
。 Working sample using lxml.html
: 使用lxml.html
工作示例:
from lxml.html import fromstring
data = """
<description>
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>.
</description>
"""
tree = fromstring(data)
print(tree.xpath("//description")[0].text_content().strip())
Prints: 印刷品:
the thing stuff is very important for various reasons, notably other things.
I forgot to specify one thing though, sorry. 我忘了指定一件事,对不起。 My ideal parsed version would contain a list of subsections: [normal("the thing"), bold("stuff"), normal("....")], is that possible with the lxml.html library? 我理想的分析版本将包含一个小节列表:[normal(“ the something”),bold(“ stuff”),normal(“ ....”)],使用lxml.html库可以吗?
Assuming you'll have only text nodes and b
elements inside a description: 假设描述中只有文本节点和b
元素:
for item in tree.xpath("//description/*|//description/text()"):
print([item.strip(), 'normal'] if isinstance(item, basestring) else [item.text, 'bold'])
Prints: 印刷品:
['the thing', 'normal']
['stuff', 'bold']
['is very important for various reasons, notably', 'normal']
['other things', 'bold']
['.', 'normal']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.