使用 xpath 将 XML 文件解析为字典

Question

Provided I have a XML response as:假设我有一个 XML 响应为：

from lxml import etree
XML_string= '''<div type="description" xml:base="elpais.es" xml:lang="es" xml:id="f0910b98">
<p xml:id="_657a490035" n="0001">blabla1</p>
<p xml:id="_657a490036" n="0002">blabla2. bla bla 2.</p>
<p xml:id="_657a490037" n="0003">blabla3.blabla3</p>
<p xml:id="_657a490038" n="0004">bla4</p></div>'''

I parse it as follows:我解析如下：

parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(XML_string.encode() , parser=parser)

I am after transforming the XML into a dict as follows:我将 XML 转换为字典后，如下所示：

result_list = [{'id':"_657a490035", 'n':'001', 'text':'blabla1'},
{'id':"_657a490036", 'n':'002', 'text':'blabla2'}
etc

I am very close seeing this:我非常接近看到这个：

all_paras = XML_tree.xpath('.//p[@xml:id]')
result_list = []
for para in all_paras:
    result_list.append({'text':para.text,'id':'id?','n':'n??'})

I dont know how to access the content of the attributes in the node para.我不知道如何访问节点 para 中属性的内容。

Some help?一些帮助？

EDIT: Be aware taht if you do:编辑：如果你这样做，请注意：

for para in all_paras:
     print(para.attrib)

I get the strange dict:我得到了奇怪的命令：

 '{http://www.w3.org/XML/1998/namespace}id': '_657a490035', 'n': '0001'}

For some reason xml:id gets into this: {http://www.w3.org/XML/1998/namespace}id'由于某种原因 xml:id 进入这个：{http://www.w3.org/XML/1998/namespace}id'

Answer 1

You are getting entangled with namespaces, unfortunately.不幸的是，您正与命名空间纠缠在一起。 One way to handle the problem is to use local-name() :解决问题的一种方法是使用local-name() ：

for para in all_paras:   
    #I simplified the id attribute value a bit, for simplicity 
    result_list.append({'id':para.xpath('./@*[local-name()="id"]')[0],'n':para.xpath('./@*[local-name()="n"]')[0],'text':para.text})
result_list

Output: Output：

[{'id': '1', 'n': '0001', 'text': 'blabla1'},
 {'id': '2', 'n': '0002', 'text': 'blabla2. bla bla 2.'},
 {'id': '3', 'n': '0003', 'text': 'blabla3.blabla3'},
 {'id': '4', 'n': '0004', 'text': 'bla4'}]

Answer 2

xml: in xml:lang , xml:id and xml:base is a special namespace prefix , bound to the http://www.w3.org/XML/1998/namespace namespace URI. xml: in xml:lang , xml:id and xml:base is a special namespace prefix , bound to the http://www.w3.org/XML/1998/namespace namespace URI. Unlike any other prefix, it does not need to be declared in the XML document.与任何其他前缀不同，它不需要在 XML 文档中声明。

You can get the values of the xml:id attributes via xpath() , like this:您可以通过xpath()获取xml:id属性的值，如下所示：

for para in all_paras:
    result_list.append({'text': para.text, 'id': para.xpath('@xml:id')[0]})

You could also use the get() method, but then you would have to provide the full namespace URI enclosed in braces:您也可以使用get()方法，但是您必须提供用大括号括起来的完整命名空间 URI：

for para in all_paras:
    result_list.append({'text': para.text, 'id': para.get("{http://www.w3.org/XML/1998/namespace}id")})

使用 xpath 将 XML 文件解析为字典

问题描述

2 个解决方案

解决方案1
0 已采纳 2021-05-19 00:09:11

解决方案2
0 2021-05-19 08:12:36

使用 xpath 将 XML 文件解析为字典

问题描述

2 个解决方案

解决方案1 0 已采纳 2021-05-19 00:09:11

解决方案2 0 2021-05-19 08:12:36

解决方案1
0 已采纳 2021-05-19 00:09:11

解决方案2
0 2021-05-19 08:12:36