[英]python xpath parsing of xml avoiding <lb/>
我正在使用 xpath 解析 xml 文件
from lxml import etree
example='''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
我想通过以下方式序列化上面的XML文件:
{"_3a327f0003": "1. A car is",
"_3a327f0004":"- big, yellow and red;"
"_3a327f0005":"- has a big motor;"
"_3a327f0006":"- and also has big seats"
基本上提取文本并构建一个字典,其中每个文本都属于他的xml:id
。 我的代码如下:
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(example.encode() , parser=parser)
all_paras = XML_tree.xpath('.//p[@xml:id]')
list_of_paragraphs = []
for para in all_paras:
mydict = {}
mydict['text'] = para.text
for att in para.attrib:
mykey=att
if 'id' in mykey:
mykey='xmlid'
mydict[mykey] = para.attrib[att]
list_of_paragraphs.append(mydict)
PDM_XML_serializer(example)
它的工作原理是,如果我有一个像这样的节点:
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
它不会从
我应该如何修改:
XML_tree.xpath('.//p[@xml:id]')
为了从 <p 到 /p> 中获取所有文本?
编辑: para.itertext() 可以使用,但是第一个节点也会返回其他节点的所有文本。
使用xml.etree.ElementTree
import xml.etree.ElementTree as ET
xml = '''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
def _get_element_txt(element):
txt = element.text
children = list(element)
if children:
txt += children[0].tail.strip()
return txt
root = ET.fromstring(xml)
data = {p.attrib['{http://www.w3.org/XML/1998/namespace}id']: _get_element_txt(p)
for p in root.findall('.//p/p')}
for k, v in data.items():
print(f'{k} --> {v}')
output
_3a327f0004 --> - big, yellow and red;
_3a327f0005 --> - has a big motor;
_3a327f0006 --> - and also has big seats.
使用lxml.etree
在列表/字典理解中解析all_paras
中的所有元素。 由于您的 XML 使用特殊的xml
前缀,并且lxml
尚不支持解析属性中的命名空间前缀(请参见此处的@mzjn 答案),因此下面使用带有next
+ iter
的解决方法来检索属性值。
此外,为了检索节点之间的所有文本值, xpath("text()")
与str.strip
和.join
一起使用以清理空格和换行符并连接在一起。
from lxml import etree
example='''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
XML_tree = etree.fromstring(example)
all_paras = XML_tree.xpath('.//p[@xml:id]')
output = {
next(iter(t.attrib.values())):" ".join(i.strip()
for i in t.xpath("text()")).strip()
for t in all_paras
}
output
# {
# '_3a327f0003': '1. A car is',
# '_3a327f0004': '- big, yellow and red;',
# '_3a327f0005': '- has a big motor;',
# '_3a327f0006': '- and also has big seats.'
# }
根据您的示例,这会修改 xpath 以排除“A car is”文本。 它还使用 xpath 函数string
和normalize-space
将para
节点评估为字符串并连接其文本节点,并清理文本以匹配您的示例。
from lxml import etree
example='''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(example.encode() , parser=parser)
all_paras = XML_tree.xpath('./p/p[@xml:id]')
list_of_paragraphs = []
for para in all_paras:
mydict = {}
mydict['text'] = para.xpath('normalize-space(string(.))')
for att in para.attrib:
mykey=att
if 'id' in mykey:
mykey='xmlid'
mydict[mykey] = para.attrib[att]
list_of_paragraphs.append(mydict)
PDM_XML_serializer(example)
如果这些标签对您来说只是噪音,您可以在阅读 xml 之前简单地删除它们
XML_tree = etree.fromstring(example.replace('<lb/>', '').encode() , parser=parser)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.