简体   繁体   English

在 Python 中解析 XML 但忽略实体

[英]Parse XML in Python but ignoring entity

I have to pase an XML file and store data into a database.我必须传递一个 XML 文件并将数据存储到数据库中。 The problem is this XML have some entity that I don't wanto to import but, instead, I want the raw entity tag.问题是这个 XML 有一些我不想导入的实体,但相反,我想要原始实体标签。 To clarify better I have the following schema:为了更好地澄清,我有以下架构:

<!ENTITY exa "example">
.....
<mytag>&exa;</mytag>

If I try to parse the above code reading the tag "mytag" using the folliwng code:如果我尝试使用以下代码解析上面读取标签“mytag”的代码:

import xml.etree.ElementTree as ET

tree = ET.parse(xmlfile)
root = tree.getroot()

for item in root:
        if item.tag == "mytag":

I read the string "example".我读了字符串“example”。 Instead I want to have the tag "exa".相反,我想要标签“exa”。 I guess is possible but cause I'm new t python delelompent I can not find the right way to get this result.我想是可能的,但因为我是新的 t python delelompent 我找不到正确的方法来获得这个结果。 Some suggestions?一些建议? Thank you谢谢

Here is an example to start:下面是一个开始的例子:

import os
import re
from lxml import etree

xmlfile = 'testfile.xml'
xml_path = '%s/%s' % (os.path.dirname(os.path.realpath(__file__)), xmlfile)

parser = etree.XMLParser(resolve_entities=False)
tree = etree.parse(xml_path, parser)
# root = tree.getroot()

root = tree.xpath('/mytag')

for item in root:
    entity = etree.tostring(item, pretty_print=True).decode('utf-8')
    print('ENTITY     : ', entity)
    entity_value = re.findall(r'&(.*?);', entity)
    print('Parsed str : ', entity_value)

But there may be a simpler way to recover the value.但是可能有一种更简单的方法来恢复该值。

You can modify each of the ENTITY tags in the xml file so that they have the values you want in them and then modify them back at the end.您可以修改 xml 文件中的每个ENTITY标记,以便它们具有您想要的值,然后在最后将它们修改回来。

You could create a class that clones your xml file:您可以创建一个类来克隆您的 xml 文件:

import os
import re

class NoEntities:
    """
    Creates a clone of the target xml file such that the <!ENTITY x "y"> tags
    become <!ENTITY x "x">.
    """

    def __init__(self, xmlFile):
        self.targetName = xmlFile
        self.tmpName = 'temp.xml'

    def __enter__(self):
        match = r'<!ENTITY\s+(\S+)\s+"[^"]+"\s*>'
        replace = r'<!ENTITY \1 "\1">'

        with open(self.targetName) as target:
            with open(self.tmpName, 'w') as tmp:
                tmp.writelines(
                    re.sub(match, replace, line)
                    for line in target
                )

        return self.tmpName

    def __exit__(self, *exec_info):
        os.remove(self.tmpName)

And then use it inside a with block:然后在 with 块中使用它:

import xml.etree.ElementTree as ET

with NoEntities(pathToOriginalXml) as noEntityXml:
    tree = ET.parse(noEntityXml)
    # Do what you like...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM