[英]How to get all XPaths from XML with just key names and no template URLs, with Python
I need to extract XPaths and values from XML object. Currently I use lxml
which with either gives long paths with repeated template URLS or just indices of XPaths keys without names.我需要从 XML object 中提取 XPath 和值。目前我使用lxml
,它要么给出带有重复模板 URL 的长路径,要么只是没有名称的 XPath 键的索引。
Question: How to get Xpaths with just names, without template URLs.问题:如何只使用名称而不使用模板 URL 获取 Xpath。 Yes, string cleanup after parsing works, but I hope to find a clean solution using lxml
or similar library是的,解析后的字符串清理有效,但我希望使用lxml
或类似库找到一个干净的解决方案
getelementpath()
: has template URLs and '\n\t\t'
in empty keys. with getelementpath()
:具有模板 URL 和空键中'\n\t\t'
。>> [(root1.getelementpath(e), e.text) for e in root1.iter()][5:10]
[('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
'ISO_639-1'),
('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}code_string',
'xx'),
('{http://schemas.oceanehr.com/templates}territory', '\n\t\t'),
('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id',
'\n\t\t\t'),
('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
'ISO_3166-1')]
getpath()
: has no key names URLs and '\n\t\t'
in empty keys. with getpath()
:在空键中没有键名 URL 和'\n\t\t'
。>> [(root1.getpath(e), e.text) for e in root1.iter()][5:10]
[('/*/*[2]/*[1]/*', 'ISO_639-1'),
('/*/*[2]/*[2]', 'xx'),
('/*/*[3]', '\n\t\t'),
('/*/*[3]/*[1]', '\n\t\t\t'),
('/*/*[3]/*[1]/*', 'ISO_3166-1')]
None
in empty keys.我需要的是:键名 URL 和空键中的None
。 I believe I've seen it somewhere, but can't find now...我相信我在某个地方见过它,但现在找不到......[('language/terminology_id/value', 'ISO_639-1'),
('language/code_string','xx'),
('territory', None),
('territory/terminology_id', None),
('territory/terminology_id/value', 'ISO_3166-1')]
this is the XML header:这是 XML header:
<?xml version="1.0" ?>
<Lab test results
xmlns="http://schemas.oceanehr.com/templates"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:rm="http://schemas.openehr.org/v1"
template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
<name>
<value>Lab test results</value>
</name>
<language>
<terminology_id>
<value>ISO_639-1</value>
</terminology_id>
<code_string>ru</code_string>
I'd still use .getpath()
.我仍然会使用.getpath()
。
The reason you're getting *
in your paths is because your XML has a default namespace.您在路径中获得*
的原因是因为您的 XML 具有默认名称空间。 By using *
the namespace doesn't need to be taken into account when creating a usable xpath.通过使用*
创建可用的 xpath 时不需要考虑命名空间。
To resolve this, first set the element name ( .tag
) to the local-name (element name without prefix or uri).要解决此问题,首先将元素名称 ( .tag
) 设置为本地名称(没有前缀或 uri 的元素名称)。
Also, you can create an XMLParser
and set remove_blank_text
to True
to get rid of the entries that are only whitespace.此外,您可以创建一个XMLParser
并将remove_blank_text
设置为True
以删除只有空格的条目。
Example...例子...
XML Input (test.xml) XML 输入(test.xml)
<Lab_test_results
xmlns="http://schemas.oceanehr.com/templates"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:rm="http://schemas.openehr.org/v1"
template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
<name>
<value>Lab test results</value>
</name>
<language>
<terminology_id>
<value>ISO_639-1</value>
</terminology_id>
</language>
</Lab_test_results>
Python Python
from lxml import etree
from pprint import pprint
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('test.xml', parser=parser)
xpaths = []
for elem in tree.iter():
elem.tag = etree.QName(elem).localname
xpaths.append((tree.getpath(elem), elem.text))
pprint(xpaths)
Printed Output印刷 Output
[('/Lab_test_results', None),
('/Lab_test_results/name', None),
('/Lab_test_results/name/value', 'Lab test results'),
('/Lab_test_results/language', None),
('/Lab_test_results/language/terminology_id', None),
('/Lab_test_results/language/terminology_id/value', 'ISO_639-1')]
If you need to also collect attributes, you can make a few small changes...如果你还需要收集属性,你可以做一些小的改变......
for elem in tree.iter():
elem.tag = etree.QName(elem).localname
xpath = tree.getpath(elem)
xpaths.append((xpath, elem.text))
for attr in elem.attrib:
xpaths.append((f"{xpath}/@{attr}", elem.get(attr)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.