[英]how do I extract an element, sub-elements and the full path from xml in python?
I would like to extract an element, including sub-elements and the full path from xml.我想从 xml 中提取一个元素,包括子元素和完整路径。
If this is my xml doc:如果这是我的 xml 文档:
<world>
<countries>
<country>
<name>a</name>
<description>a short description</description>
<population>
<now>250000</now>
<2000>100000</2000>
</population>
</country>
<country>
<name>b</name>
<description>b short description</description>
<population>
<now>350000</now>
<2000>150000</2000>
</population>
</country>
</countries>
</world>
I would like to end up with this (see below) based on an xpath expression of ('//country[name="a"]我想基于 ('//country[name="a"]
<world>
<countries>
<country>
<name>a</name>
<description>a short description</description>
<population>
<now>250000</now>
<2000>100000</2000>
</population>
</country>
</countries>
</world>
This type of thing can be taken care of using xpath with lxml.这类事情可以使用带有 lxml 的 xpath 来处理。
One thing, though, one of the html tags ( <2000>
) is invalid since it doesn't begin with a letter.不过,有一件事是 html 标签( <2000>
)之一是无效的,因为它不是以字母开头。 If you have no control over the source, you have to replace the offending tag before parsing and then replace it again after processing.如果您无法控制源,则必须在解析之前替换有问题的标签,然后在处理后再次替换它。
So, all together:所以,一起来:
import lxml.html as lh
countries = """[your html above]"""
doc = lh.fromstring(countries.replace('2000','xxx'))
states = doc.xpath('//country')
for country in states:
if country.xpath('./name/text()')[0]!='a':
country.getparent().remove(country)
print(lh.tostring(doc).decode().replace('xxx','2000'))
Output: Output:
<world>
<countries>
<country>
<name>a</name>
<description>a short description</description>
<population>
<now>250000</now>
<2000>100000</2000>
</population>
</country>
</countries>
</world>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.