简体   繁体   English

如何从 XML 获取所有 XPaths,只有键名,没有模板 URL,Python

[英]How to get all XPaths from XML with just key names and no template URLs, with Python

I need to extract XPaths and values from XML object. Currently I use lxml which with either gives long paths with repeated template URLS or just indices of XPaths keys without names.我需要从 XML object 中提取 XPath 和值。目前我使用lxml ,它要么给出带有重复模板 URL 的长路径,要么只是没有名称的 XPath 键的索引。

Question: How to get Xpaths with just names, without template URLs.问题:如何只使用名称而不使用模板 URL 获取 Xpath。 Yes, string cleanup after parsing works, but I hope to find a clean solution using lxml or similar library是的,解析后的字符串清理有效,但我希望使用lxml或类似库找到一个干净的解决方案

  1. with getelementpath() : has template URLs and '\n\t\t' in empty keys. with getelementpath() :具有模板 URL 和空键中'\n\t\t'
>> [(root1.getelementpath(e), e.text) for e in root1.iter()][5:10]

[('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
  'ISO_639-1'),
 ('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}code_string',
  'xx'),
 ('{http://schemas.oceanehr.com/templates}territory', '\n\t\t'),
 ('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id',
  '\n\t\t\t'),
 ('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
  'ISO_3166-1')]
  1. with getpath() : has no key names URLs and '\n\t\t' in empty keys. with getpath() :在空键中没有键名 URL 和'\n\t\t'
>> [(root1.getpath(e), e.text) for e in root1.iter()][5:10]

[('/*/*[2]/*[1]/*', 'ISO_639-1'),
 ('/*/*[2]/*[2]', 'xx'),
 ('/*/*[3]', '\n\t\t'),
 ('/*/*[3]/*[1]', '\n\t\t\t'),
 ('/*/*[3]/*[1]/*', 'ISO_3166-1')]
  1. what I need: key names URLs and None in empty keys.我需要的是:键名 URL 和空键中的None I believe I've seen it somewhere, but can't find now...我相信我在某个地方见过它,但现在找不到......
[('language/terminology_id/value', 'ISO_639-1'),
('language/code_string','xx'),
('territory', None),
('territory/terminology_id', None),
('territory/terminology_id/value', 'ISO_3166-1')]

this is the XML header:这是 XML header:

<?xml version="1.0" ?>
<Lab test results
        xmlns="http://schemas.oceanehr.com/templates"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:rm="http://schemas.openehr.org/v1"
        template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
    <name>
        <value>Lab test results</value>
    </name>
    <language>
        <terminology_id>
            <value>ISO_639-1</value>
        </terminology_id>
        <code_string>ru</code_string>

I'd still use .getpath() .我仍然会使用.getpath()

The reason you're getting * in your paths is because your XML has a default namespace.您在路径中获得*的原因是因为您的 XML 具有默认名称空间。 By using * the namespace doesn't need to be taken into account when creating a usable xpath.通过使用*创建可用的 xpath 时不需要考虑命名空间。

To resolve this, first set the element name ( .tag ) to the local-name (element name without prefix or uri).要解决此问题,首先将元素名称 ( .tag ) 设置为本地名称(没有前缀或 uri 的元素名称)。

Also, you can create an XMLParser and set remove_blank_text to True to get rid of the entries that are only whitespace.此外,您可以创建一个XMLParser并将remove_blank_text设置为True以删除只有空格的条目。

Example...例子...

XML Input (test.xml) XML 输入(test.xml)

<Lab_test_results
        xmlns="http://schemas.oceanehr.com/templates"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:rm="http://schemas.openehr.org/v1"
        template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
    <name>
        <value>Lab test results</value>
    </name>
    <language>
        <terminology_id>
            <value>ISO_639-1</value>
        </terminology_id>
    </language>
</Lab_test_results>

Python Python

from lxml import etree
from pprint import pprint

parser = etree.XMLParser(remove_blank_text=True)

tree = etree.parse('test.xml', parser=parser)

xpaths = []

for elem in tree.iter():
    elem.tag = etree.QName(elem).localname
    xpaths.append((tree.getpath(elem), elem.text))

pprint(xpaths)

Printed Output印刷 Output

[('/Lab_test_results', None),
 ('/Lab_test_results/name', None),
 ('/Lab_test_results/name/value', 'Lab test results'),
 ('/Lab_test_results/language', None),
 ('/Lab_test_results/language/terminology_id', None),
 ('/Lab_test_results/language/terminology_id/value', 'ISO_639-1')]

If you need to also collect attributes, you can make a few small changes...如果你还需要收集属性,你可以做一些小的改变......

for elem in tree.iter():
    elem.tag = etree.QName(elem).localname
    xpath = tree.getpath(elem)
    xpaths.append((xpath, elem.text))
    for attr in elem.attrib:
        xpaths.append((f"{xpath}/@{attr}", elem.get(attr)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Python 获取网站中所有 xpath 的树? - How to get a tree of all xpaths in a website using Python? 如何仅从python中的文本文件获取基本URL? - How to get just base urls from a text file in python? 在python中解析xml以查找所有元素(节点)的xpath - Parsing an xml in python to find xpaths of all elements (nodes) Python-Beautiful Soup 即使不知道标签的所有名称,如何从 xml 文件中获取标签和文本 - Python-Beautiful Soup How to get tags and texts from a xml file even not knowing all the names of the tags 如何在xml.etree.elementree中使用带有python 2.6.4的XPath时获取父标记 - How to get parent tag while using XPaths in xml.etree.elementree with python 2.6.4 我们如何从后面获取所有文本<br>标签包括<u>标签也使用xpaths?</u> - How can we get all the text from after the <br> tag including the <u> tag also by using the xpaths? 从 Python 中的图像密钥获取 URL - Get URLs from image Key in Python 如何从 Python 中的字典列表中获取键名列表 - How to get make a list of key names from a list of dictionaries in Python 使用python从网站获取所有网址 - Get all urls from a website using python 如何仅用一个键来获取所有数据存储结果? - how to get all the datastore results just with a key?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM