如何从 XML 获取所有 XPaths，只有键名，没有模板 URL，Python

Question

I need to extract XPaths and values from XML object. Currently I use lxml which with either gives long paths with repeated template URLS or just indices of XPaths keys without names.我需要从 XML object 中提取 XPath 和值。目前我使用lxml ，它要么给出带有重复模板 URL 的长路径，要么只是没有名称的 XPath 键的索引。

Question: How to get Xpaths with just names, without template URLs.问题：如何只使用名称而不使用模板 URL 获取 Xpath。 Yes, string cleanup after parsing works, but I hope to find a clean solution using lxml or similar library是的，解析后的字符串清理有效，但我希望使用lxml或类似库找到一个干净的解决方案

with getelementpath() : has template URLs and '\n\t\t' in empty keys. with getelementpath() ：具有模板 URL 和空键中'\n\t\t' 。

>> [(root1.getelementpath(e), e.text) for e in root1.iter()][5:10]

[('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
  'ISO_639-1'),
 ('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}code_string',
  'xx'),
 ('{http://schemas.oceanehr.com/templates}territory', '\n\t\t'),
 ('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id',
  '\n\t\t\t'),
 ('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
  'ISO_3166-1')]

with getpath() : has no key names URLs and '\n\t\t' in empty keys. with getpath() ：在空键中没有键名 URL 和'\n\t\t' 。

>> [(root1.getpath(e), e.text) for e in root1.iter()][5:10]

[('/*/*[2]/*[1]/*', 'ISO_639-1'),
 ('/*/*[2]/*[2]', 'xx'),
 ('/*/*[3]', '\n\t\t'),
 ('/*/*[3]/*[1]', '\n\t\t\t'),
 ('/*/*[3]/*[1]/*', 'ISO_3166-1')]

what I need: key names URLs and None in empty keys.我需要的是：键名 URL 和空键中的None 。 I believe I've seen it somewhere, but can't find now...我相信我在某个地方见过它，但现在找不到......

[('language/terminology_id/value', 'ISO_639-1'),
('language/code_string','xx'),
('territory', None),
('territory/terminology_id', None),
('territory/terminology_id/value', 'ISO_3166-1')]

this is the XML header:这是 XML header：

<?xml version="1.0" ?>
<Lab test results
        xmlns="http://schemas.oceanehr.com/templates"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:rm="http://schemas.openehr.org/v1"
        template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
    <name>
        <value>Lab test results</value>
    </name>
    <language>
        <terminology_id>
            <value>ISO_639-1</value>
        </terminology_id>
        <code_string>ru</code_string>

Answer 1

I'd still use .getpath() .我仍然会使用.getpath() 。

The reason you're getting * in your paths is because your XML has a default namespace.您在路径中获得*的原因是因为您的 XML 具有默认名称空间。 By using * the namespace doesn't need to be taken into account when creating a usable xpath.通过使用*创建可用的 xpath 时不需要考虑命名空间。

To resolve this, first set the element name ( .tag ) to the local-name (element name without prefix or uri).要解决此问题，首先将元素名称 ( .tag ) 设置为本地名称（没有前缀或 uri 的元素名称）。

Also, you can create an XMLParser and set remove_blank_text to True to get rid of the entries that are only whitespace.此外，您可以创建一个XMLParser并将remove_blank_text设置为True以删除只有空格的条目。

Example...例子...

XML Input (test.xml) XML 输入（test.xml）

<Lab_test_results
        xmlns="http://schemas.oceanehr.com/templates"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:rm="http://schemas.openehr.org/v1"
        template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
    <name>
        <value>Lab test results</value>
    </name>
    <language>
        <terminology_id>
            <value>ISO_639-1</value>
        </terminology_id>
    </language>
</Lab_test_results>

Python Python

from lxml import etree
from pprint import pprint

parser = etree.XMLParser(remove_blank_text=True)

tree = etree.parse('test.xml', parser=parser)

xpaths = []

for elem in tree.iter():
    elem.tag = etree.QName(elem).localname
    xpaths.append((tree.getpath(elem), elem.text))

pprint(xpaths)

Printed Output印刷 Output

[('/Lab_test_results', None),
 ('/Lab_test_results/name', None),
 ('/Lab_test_results/name/value', 'Lab test results'),
 ('/Lab_test_results/language', None),
 ('/Lab_test_results/language/terminology_id', None),
 ('/Lab_test_results/language/terminology_id/value', 'ISO_639-1')]

If you need to also collect attributes, you can make a few small changes...如果你还需要收集属性，你可以做一些小的改变......

for elem in tree.iter():
    elem.tag = etree.QName(elem).localname
    xpath = tree.getpath(elem)
    xpaths.append((xpath, elem.text))
    for attr in elem.attrib:
        xpaths.append((f"{xpath}/@{attr}", elem.get(attr)))

如何从 XML 获取所有 XPaths，只有键名，没有模板 URL，Python

问题描述

1 个解决方案

解决方案1
1 已采纳 2023-01-27 22:48:47

如何从 XML 获取所有 XPaths，只有键名，没有模板 URL，Python

问题描述

1 个解决方案

解决方案1 1 已采纳 2023-01-27 22:48:47

解决方案1
1 已采纳 2023-01-27 22:48:47