python xpath 解析 xml 避免<lb />

Question

我正在使用 xpath 解析 xml 文件

from lxml import etree

example='''<div n="0001" type="car" xml:id="_3a327f0002">
                <p xml:id="_3a327f0003">
                1. A car is
                    <p xml:id="_3a327f0004"> - big, yellow and red;</p>
                    <p xml:id="_3a327f0005"> - has a big motor;</p>
                    <p xml:id="_3a327f0006"> - and also has <lb/>
                      big seats.
                    </p>
                </p>
                </div>'''

我想通过以下方式序列化上面的XML文件：

{"_3a327f0003": "1. A car is",
 "_3a327f0004":"- big, yellow and red;"
 "_3a327f0005":"- has a big motor;"
"_3a327f0006":"- and also has big seats"

基本上提取文本并构建一个字典，其中每个文本都属于他的xml:id 。 我的代码如下：

parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)

XML_tree = etree.fromstring(example.encode() , parser=parser)
all_paras = XML_tree.xpath('.//p[@xml:id]')

list_of_paragraphs = []
for para in all_paras:
    mydict = {}
    mydict['text'] = para.text
    for att in para.attrib:
        mykey=att
        if 'id' in mykey:
            mykey='xmlid'
        mydict[mykey] = para.attrib[att]
    list_of_paragraphs.append(mydict)

PDM_XML_serializer(example)

它的工作原理是，如果我有一个像这样的节点：

<p xml:id="_3a327f0006"> - and also has <lb/>
                        big seats.
                      </p>

它不会从

我应该如何修改：

XML_tree.xpath('.//p[@xml:id]')

为了从 <p 到 /p> 中获取所有文本？

编辑： para.itertext() 可以使用，但是第一个节点也会返回其他节点的所有文本。

Answer 1

使用xml.etree.ElementTree

import xml.etree.ElementTree as ET

xml = '''<div n="0001" type="car" xml:id="_3a327f0002">
                <p xml:id="_3a327f0003">
                1. A car is
                    <p xml:id="_3a327f0004"> - big, yellow and red;</p>
                    <p xml:id="_3a327f0005"> - has a big motor;</p>
                    <p xml:id="_3a327f0006"> - and also has <lb/>
                      big seats.
                    </p>
                </p>
                </div>'''


def _get_element_txt(element):
    txt = element.text
    children = list(element)
    if children:
        txt += children[0].tail.strip()
    return txt


root = ET.fromstring(xml)
data = {p.attrib['{http://www.w3.org/XML/1998/namespace}id']: _get_element_txt(p)
        for p in root.findall('.//p/p')}
for k, v in data.items():
    print(f'{k} --> {v}')

output

_3a327f0004 -->  - big, yellow and red;
_3a327f0005 -->  - has a big motor;
_3a327f0006 -->  - and also has big seats.

Answer 2

使用lxml.etree在列表/字典理解中解析all_paras中的所有元素。 由于您的 XML 使用特殊的xml前缀，并且lxml尚不支持解析属性中的命名空间前缀（请参见此处的@mzjn 答案），因此下面使用带有next + iter的解决方法来检索属性值。

此外，为了检索节点之间的所有文本值， xpath("text()")与str.strip和.join一起使用以清理空格和换行符并连接在一起。

from lxml import etree

example='''<div n="0001" type="car" xml:id="_3a327f0002">
                <p xml:id="_3a327f0003">
                1. A car is
                    <p xml:id="_3a327f0004"> - big, yellow and red;</p>
                    <p xml:id="_3a327f0005"> - has a big motor;</p>
                    <p xml:id="_3a327f0006"> - and also has <lb/>
                      big seats.
                    </p>
                </p>
                </div>'''
                
XML_tree = etree.fromstring(example)
all_paras = XML_tree.xpath('.//p[@xml:id]')

output = {
    next(iter(t.attrib.values())):" ".join(i.strip() 
        for i in t.xpath("text()")).strip()
    for t in all_paras
}

output
# {
#  '_3a327f0003': '1. A car is', 
#  '_3a327f0004': '- big, yellow and red;',
#  '_3a327f0005': '- has a big motor;',
#  '_3a327f0006': '- and also has big seats.'
# }

Answer 3

您可以使用 lxml itertext()来获取p元素的文本内容：

mydict['text'] = ''.join(para.itertext())

请参阅此问题以获取更通用的解决方案。

Answer 4

根据您的示例，这会修改 xpath 以排除“A car is”文本。 它还使用 xpath 函数string和normalize-space将para节点评估为字符串并连接其文本节点，并清理文本以匹配您的示例。

from lxml import etree

example='''<div n="0001" type="car" xml:id="_3a327f0002">
                <p xml:id="_3a327f0003">
                1. A car is
                    <p xml:id="_3a327f0004"> - big, yellow and red;</p>
                    <p xml:id="_3a327f0005"> - has a big motor;</p>
                    <p xml:id="_3a327f0006"> - and also has <lb/>
                      big seats.
                    </p>
                </p>
                </div>'''

parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)

XML_tree = etree.fromstring(example.encode() , parser=parser)
all_paras = XML_tree.xpath('./p/p[@xml:id]')

list_of_paragraphs = []
for para in all_paras:
    mydict = {}
    mydict['text'] = para.xpath('normalize-space(string(.))')
    for att in para.attrib:
        mykey=att
        if 'id' in mykey:
            mykey='xmlid'
        mydict[mykey] = para.attrib[att]
    list_of_paragraphs.append(mydict)

PDM_XML_serializer(example)

Answer 5

如果这些标签对您来说只是噪音，您可以在阅读 xml 之前简单地删除它们

XML_tree = etree.fromstring(example.replace('<lb/>', '').encode() , parser=parser)

python xpath 解析 xml 避免<lb />

问题描述

5 个解决方案

解决方案1
2 2021-05-25 15:19:25

解决方案2
1 2021-05-25 17:14:21

解决方案3
0 2021-05-25 10:02:19

解决方案4
0 2021-05-25 17:35:49

解决方案5
0 2021-06-06 09:49:30

python xpath 解析 xml 避免<lb />

问题描述

5 个解决方案

解决方案1 2 2021-05-25 15:19:25

解决方案2 1 2021-05-25 17:14:21

解决方案3 0 2021-05-25 10:02:19

解决方案4 0 2021-05-25 17:35:49

解决方案5 0 2021-06-06 09:49:30

解决方案1
2 2021-05-25 15:19:25

解决方案2
1 2021-05-25 17:14:21

解决方案3
0 2021-05-25 10:02:19

解决方案4
0 2021-05-25 17:35:49

解决方案5
0 2021-06-06 09:49:30