[英]How can I get the text from this HTML snippet using lxml?
Can anyone explain why this snippet fails on the assert?谁能解释为什么这个片段在断言上失败?
from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)
elements = root.xpath(".//*[contains(text(),'XYZZY')]") # Finds 1 element, as expected
for el in elements:
assert el.text is not None
And then... how can I get access to "XYZZY" and change it to "ZYX"?然后......我怎样才能访问“XYZZY”并将其更改为“ZYX”?
Can anyone explain why this snippet fails on the assert?
谁能解释为什么这个片段在断言上失败?
Because the text of the <h2>
element is stored by lxml in one of the children of the h2
element.因为
<h2>
元素的文本由 lxml 存储在h2
元素的其中一个子元素中。 You can use iternext()
to get what you're looking for.您可以使用
iternext()
来获取您要查找的内容。
from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)
elements = root.xpath(".//*[contains(text(),'XYZZY')]")
for el in elements:
el_text = ''.join(el.itertext())
assert el_text is not None
print(el_text)
UPDATE: After looking at this some more, it turns out each Element has 3 relevant properties: .tag
, .text
and .tail
.更新:再看一遍之后,发现每个元素都有 3 个相关属性:
.tag
、 .text
和.tail
。
For the .tail
property, there is a small part in the tutorial that explains it:对于
.tail
属性,教程中有一小部分对其进行了解释:
<html><body>Hello<br/>World</body></html>
Here, the
在这里,
tag is surrounded by text.标签被文本包围。 This is often referred to as document-style or mixed-content XML.
这通常称为文档样式或混合内容 XML。 Elements support this through their tail property.
元素通过其 tail 属性支持这一点。 It contains the text that directly follows the element, up to the next element in the XML tree
它包含直接跟随元素的文本,直到 XML 树中的下一个元素
How .tail
is being populated is again explained here : 这里再次解释了如何填充
.tail
:
LXML appends trailing text, which is not wrapped inside it's own tag, as the
.tail
attribute of the tag just prior.LXML 附加尾随文本,它没有包含在它自己的标签内,作为之前标签的
.tail
属性。
So we can actually write the following code, to walk through each Element in the Element tree and find where the text XYZZY
is located:所以我们实际上可以编写以下代码,遍历元素树中的每个元素,并找到文本
XYZZY
所在的位置:
from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)
context = etree.iterwalk(root, events=("start","end"))
for action, elem in context:
print("%s: %s : [text=%s : tail=%s]" % (action, elem.tag, elem.text, elem.tail))
Output: Output:
start: div : [text=None : tail=None]
start: h2 : [text=None : tail=None]
start: img : [text=None : tail=XYZZY]
end: img : [text=None : tail=XYZZY]
end: h2 : [text=None : tail=None]
end: div : [text=None : tail=None]
So it is located in the .tail
property of the <img>
Element.所以它位于
<img>
元素的.tail
属性中。
About your 2nd question:关于你的第二个问题:
And then... how can I get access to "XYZZY" and change it to "ZYX"?
然后......我怎样才能访问“XYZZY”并将其更改为“ZYX”?
One solution is to just walk the Element tree, check whether each element has the string in its text or tail, and then replace it:一种解决方案是遍历元素树,检查每个元素的文本或尾部是否包含字符串,然后替换它:
#!/usr/bin/python3
from lxml import etree
s = '<div><h2><img />XYZZY</h2></div>'
root = etree.fromstring(s)
search_string = "XYZZY"
replace_string = "ZYX"
context = etree.iterwalk(root, events=("start","end"))
for action, elem in context:
if elem.text and elem.text.strip() == search_string:
elem.text = replace_string
elif elem.tail and elem.tail.strip() == search_string:
elem.tail = replace_string
print(etree.tostring(root).decode("utf-8"))
Output: Output:
<div><h2><img/>ZYX</h2></div>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.