简体   繁体   English

用python解析lxml:如何使用objectify

[英]lxml parsing with python: how to with objectify

I am trying to read xml behind an spss file, I would like to move from etree to objectify. 我正在尝试读取spss文件后面的xml,我想从etree转向对象化。

How can I convert this function below to return an objectify object? 我如何在下面转换此函数以返回一个对象化对象? I would like to do this because objectify xml object would be easier for me (as a newbie) to work with as it is more pythonic. 我想这样做是因为Objectify xml对象对于我(作为新手)来说更容易使用,因为它具有更多的Python风格。

def get_etree(path_file):

    from lxml import etree

    with open(path_file, 'r+') as f:
        xml_text = f.read()     
    recovering_parser = etree.XMLParser(recover=True)    
    xml = etree.parse(StringIO(xml_text), parser=recovering_parser)

    return xml

my failed attempt: 我的失败尝试:

def get_etree(path_file):

    from lxml import etree, objectify

    with open(path_file, 'r+') as f:
        xml_text = objectify.fromstring(xml)   

    return xml

but I get this error: 但是我得到这个错误:

lxml.etree.XMLSyntaxError: xmlns:mdm: 'http://www.spss.com/mr/dm/metadatamodel/Arc 3/2000-02-04' is not a valid URI

The first, biggest mistake is to read a file into a string and feed that string to an XML parser. 第一个最大的错误是将文件读取为字符串并将该字符串提供给XML解析器。

Python will read the file as whatever your default file encoding is (unless you specify the encoding when you call read() ), and that step will very likely break anything other than plain ASCII files. Python将以默认文件编码的任何形式读取文件(除非在调用read()时指定了编码),并且该步骤很可能会破坏除普通ASCII文件以外的任何内容。

XML files come in many encodings, you cannot predict them, and you really shouldn't make assumptions about them. XML文件采用多种编码,您无法预测它们,因此您实际上不应对它们进行任何假设。 XML files solve that problem with the XML declaration . XML文件通过XML声明解决了该问题。

<?xml version="1.0" encoding="Windows-1252"?>

An XML parser will read that bit of information and configure itself correctly before reading the rest of the file. XML解析器将读取该信息位并在读取文件的其余部分之前正确配置自身。 Make use of that facility. 利用该功能。 Never use open() and read() for XML files. 切勿对XML文件使用open()read()

Luckily lxml makes it very easy: 幸运的是,lxml使其非常容易:

from lxml import etree, objectify

def get_etree(path_file):
    return etree.parse(path_file, parser=etree.XMLParser(recover=True))

def get_objectify(path_file):
    return objectify.parse(path_file)

and

path = r"/path/to/your.xml"
xml1 = get_etree(path)
xml2 = get_objectify(path)

print xml1   # -> <lxml.etree._ElementTree object at 0x02A7B918>
print xml2   # -> <lxml.etree._ElementTree object at 0x02A7B878>

PS: Think hard if you really, positively must use a recovering parser. PS:如果您确实必须积极使用恢复分析器,请认真考虑。 An XML file is a data structure. XML文件是一种数据结构。 If it is broken (syntactically invalid, incomplete, wrongly decoded, you name it), would you really want to trust the (by definition undefined) result of an attempt to read it anyway or would you much rather reject it and display an error message? 如果它损坏了(从语法上讲是无效的,不完整的,解码错误的,请您给它命名),您是否真的想信任尝试读取它的(按定义未定义)结果,还是您宁愿拒绝它并显示错误消息?

I would do the latter. 我会做后者。 Using a recovering parser may cause nasty run-time errors later. 使用恢复分析器可能会在以后导致讨厌的运行时错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM