[英]Python: Avoid DTD validation with LXML
I am parsing USPTO patents from 2001 in SGML format.我正在以 SGML 格式解析 2001 年的 USPTO 专利。 At top of each file, an external DTD is referenced.
在每个文件的顶部,引用了一个外部DTD 。 Unfortunately, this DTD seems to be invalid.
不幸的是,这个 DTD 似乎是无效的。 A validity check confirms that:
有效性检查确认:
Line 361
Error: A '(' character or an element type is required within declaration of element type "ADR".
<!ELEMENT ADR - - (OMC?,STR*,CITY?,CNTY?,STATE?,CTRY?,PCODE?,EAD*,TEL*,FAX* ...
However, I do not need to validate the SGML files to be processed.但是,我不需要验证要处理的 SGML 文件。 I just need the SGML parser to be aware of the entities.
我只需要 SGML 解析器来了解实体。 Currently, I am using Python with the LXML library.
目前,我正在使用 Python 和 LXML 库。 I call the XMLParser as follows:
我将 XMLParser 称为如下:
parser = etree.XMLParser(target=SimpleXMLHandler(), resolve_entities=False, load_dtd=dtd, dtd_validation=False, recover=True)
But still, I am getting immediately the error that the external DTD is invalid in line 361. How can I avoid that issue?但是,我仍然立即收到第 361 行中外部 DTD 无效的错误。如何避免该问题? I am not the implementor of the DTD, so I am not willing to repair it.
我不是DTD的实现者,所以我不愿意修复它。
Regards!问候!
As Chrono Kitsune already noted: the problem lies with xml versus sgml: the DTD is not a correct xml dtd, because it is an sgml dtd.正如 Chrono Kitsune 已经指出的那样:问题在于 xml 与 sgml:DTD 不是正确的 xml dtd,因为它是 sgml dtd。
I'd suggest converting the sgml documents to xml first, for example using sx .我建议首先将 sgml 文档转换为 xml ,例如使用sx 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.