如何使用 lxml 从这个 HTML 片段中获取文本？

Question

Can anyone explain why this snippet fails on the assert?谁能解释为什么这个片段在断言上失败？

from lxml import etree

s = '<div><h2><img />XYZZY</h2></div>'

root = etree.fromstring(s)

elements = root.xpath(".//*[contains(text(),'XYZZY')]")  # Finds 1 element, as expected

for el in elements:
    assert el.text is not None

And then... how can I get access to "XYZZY" and change it to "ZYX"?然后......我怎样才能访问“XYZZY”并将其更改为“ZYX”？

Answer 1

Can anyone explain why this snippet fails on the assert?谁能解释为什么这个片段在断言上失败？

Because the text of the <h2> element is stored by lxml in one of the children of the h2 element.因为<h2>元素的文本由 lxml 存储在h2元素的其中一个子元素中。 You can use iternext() to get what you're looking for.您可以使用iternext()来获取您要查找的内容。

from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)
elements = root.xpath(".//*[contains(text(),'XYZZY')]")
for el in elements:
    el_text = ''.join(el.itertext())
    assert el_text is not None
    print(el_text)

UPDATE: After looking at this some more, it turns out each Element has 3 relevant properties: .tag , .text and .tail .更新：再看一遍之后，发现每个元素都有 3 个相关属性： .tag 、 .text和.tail 。

For the .tail property, there is a small part in the tutorial that explains it:对于.tail属性，教程中有一小部分对其进行了解释：

<html><body>Hello<br/>World</body></html>

Here, the在这里，
tag is surrounded by text.标签被文本包围。 This is often referred to as document-style or mixed-content XML.这通常称为文档样式或混合内容 XML。 Elements support this through their tail property.元素通过其 tail 属性支持这一点。 It contains the text that directly follows the element, up to the next element in the XML tree它包含直接跟随元素的文本，直到 XML 树中的下一个元素

How .tail is being populated is again explained here : 这里再次解释了如何填充.tail ：

LXML appends trailing text, which is not wrapped inside it's own tag, as the .tail attribute of the tag just prior. LXML 附加尾随文本，它没有包含在它自己的标签内，作为之前标签的.tail属性。

So we can actually write the following code, to walk through each Element in the Element tree and find where the text XYZZY is located:所以我们实际上可以编写以下代码，遍历元素树中的每个元素，并找到文本XYZZY所在的位置：

from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)

context = etree.iterwalk(root, events=("start","end"))
for action, elem in context:
    print("%s: %s : [text=%s : tail=%s]" % (action, elem.tag, elem.text, elem.tail))

Output: Output：

start: div : [text=None : tail=None]
start: h2 : [text=None : tail=None]
start: img : [text=None : tail=XYZZY]
end: img : [text=None : tail=XYZZY]
end: h2 : [text=None : tail=None]
end: div : [text=None : tail=None]

So it is located in the .tail property of the <img> Element.所以它位于<img>元素的.tail属性中。

About your 2nd question:关于你的第二个问题：

And then... how can I get access to "XYZZY" and change it to "ZYX"?然后......我怎样才能访问“XYZZY”并将其更改为“ZYX”？

One solution is to just walk the Element tree, check whether each element has the string in its text or tail, and then replace it:一种解决方案是遍历元素树，检查每个元素的文本或尾部是否包含字符串，然后替换它：

#!/usr/bin/python3
from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)

search_string = "XYZZY"
replace_string = "ZYX"

context = etree.iterwalk(root, events=("start","end"))
for action, elem in context:
    if elem.text and elem.text.strip() == search_string:
        elem.text = replace_string
    elif elem.tail and elem.tail.strip() == search_string:
        elem.tail = replace_string

print(etree.tostring(root).decode("utf-8"))

Output: Output：

<div><h2><img/>ZYX</h2></div>

如何使用 lxml 从这个 HTML 片段中获取文本？

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-12-30 11:33:46

如何使用 lxml 从这个 HTML 片段中获取文本？

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-12-30 11:33:46

解决方案1
2 已采纳 2020-12-30 11:33:46