简体   繁体   English

Python:避免使用 LXML 进行 DTD 验证

[英]Python: Avoid DTD validation with LXML

I am parsing USPTO patents from 2001 in SGML format.我正在以 SGML 格式解析 2001 年的 USPTO 专利。 At top of each file, an external DTD is referenced.在每个文件的顶部,引用了一个外部DTD Unfortunately, this DTD seems to be invalid.不幸的是,这个 DTD 似乎是无效的。 A validity check confirms that:有效性检查确认:

Line 361
Error: A '(' character or an element type is required within declaration of element type "ADR".
<!ELEMENT ADR  - - (OMC?,STR*,CITY?,CNTY?,STATE?,CTRY?,PCODE?,EAD*,TEL*,FAX* ...

However, I do not need to validate the SGML files to be processed.但是,我不需要验证要处理的 SGML 文件。 I just need the SGML parser to be aware of the entities.我只需要 SGML 解析器来了解实体。 Currently, I am using Python with the LXML library.目前,我正在使用 Python 和 LXML 库。 I call the XMLParser as follows:我将 XMLParser 称为如下:

parser = etree.XMLParser(target=SimpleXMLHandler(), resolve_entities=False, load_dtd=dtd, dtd_validation=False, recover=True)  

But still, I am getting immediately the error that the external DTD is invalid in line 361. How can I avoid that issue?但是,我仍然立即收到第 361 行中外部 DTD 无效的错误。如何避免该问题? I am not the implementor of the DTD, so I am not willing to repair it.我不是DTD的实现者,所以我不愿意修复它。

Regards!问候!

As Chrono Kitsune already noted: the problem lies with xml versus sgml: the DTD is not a correct xml dtd, because it is an sgml dtd.正如 Chrono Kitsune 已经指出的那样:问题在于 xml 与 sgml:DTD 不是正确的 xml dtd,因为它是 sgml dtd。

I'd suggest converting the sgml documents to xml first, for example using sx .我建议首先将 sgml 文档转换为 xml ,例如使用sx

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM