简体   繁体   English

如何从 python 中的 xml 中提取元素、子元素和完整路径?

[英]how do I extract an element, sub-elements and the full path from xml in python?

I would like to extract an element, including sub-elements and the full path from xml.我想从 xml 中提取一个元素,包括子元素和完整路径。

If this is my xml doc:如果这是我的 xml 文档:

<world>
    <countries>
        <country>
            <name>a</name>
            <description>a short description</description>
            <population>
                <now>250000</now>
                <2000>100000</2000>
            </population>
        </country>
        <country>
            <name>b</name>
            <description>b short description</description>
            <population>
                <now>350000</now>
                <2000>150000</2000>
            </population>
        </country>
    </countries>
</world>

I would like to end up with this (see below) based on an xpath expression of ('//country[name="a"]我想基于 ('//country[name="a"]

<world>
    <countries>
        <country>
            <name>a</name>
            <description>a short description</description>
            <population>
                <now>250000</now>
                <2000>100000</2000>
            </population>
        </country>
    </countries>
</world>

This type of thing can be taken care of using xpath with lxml.这类事情可以使用带有 lxml 的 xpath 来处理。

One thing, though, one of the html tags ( <2000> ) is invalid since it doesn't begin with a letter.不过,有一件事是 html 标签( <2000> )之一是无效的,因为它不是以字母开头。 If you have no control over the source, you have to replace the offending tag before parsing and then replace it again after processing.如果您无法控制源,则必须在解析之前替换有问题的标签,然后在处理后再次替换它。

So, all together:所以,一起来:

import lxml.html as lh
countries = """[your html above]"""
doc = lh.fromstring(countries.replace('2000','xxx'))

states = doc.xpath('//country')
for country in states:
    if country.xpath('./name/text()')[0]!='a':
        country.getparent().remove(country)
print(lh.tostring(doc).decode().replace('xxx','2000'))

Output: Output:

<world>
    <countries>
        <country>
            <name>a</name>
            <description>a short description</description>
            <population>
                <now>250000</now>
                <2000>100000</2000>
            </population>
        </country>
        </countries>
</world>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从 Python 中的 XML 中的列表中提取子元素 - How to extract sub-elements from a list within an XML in Python 如何使用 Python ElementTree 获取元素树的所有子元素? - How to get all sub-elements of an element tree with Python ElementTree? 在xml文件中获取元素的子元素的pythonic方法是什么 - What is the pythonic way of getting the sub-elements of an element in an xml file 使用Python解析XML时,定位特定的子元素 - Targeting specific sub-elements when parsing XML with Python 如何使用python获取XML标签内子元素的大小/长度 - How to get the size/length of sub-elements within an XML tag using python 如何打开属于pandas数据框中某个元素的所有xml子元素,每个子元素排成一行 - How to open all xml sub-elements belonging to a certain element in a pandas dataframe, with each sub-element in a row 如何迭代列表推导中的子元素? - How can I iterate through sub-elements in a list comprehension? Python 循环遍历 XML 中的元素并获取子元素值 - Python loop to iterate through elements in an XML and get sub-elements values 将具有相同标签的多个子元素添加到带有Python / Elementtree的XML树中 - Addin multiple sub-elements with same tag to en XML tree with Python/Elementtree 如何从 Python 中的 FQDN 中提取主机名和(子)域? - How do I extract the hostname and the (sub)domain from a FQDN in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM