[英]xml validation(DTD) using lxml(python)
There is a brief explanation of validation based on XML here . 有基于XML验证的简要说明这里 。 I am trying to parse an XML file that refers to nested DTD's ie: XML file refers to a DTD, which refers to other DTD's.
我试图解析一个XML文件,该XML文件引用了嵌套的DTD,即:XML文件引用了一个DTD,它引用了其他DTD。
The error I get is Namespace prefix SomeNameSpace on Config is not defined. 我得到的错误是未定义Config上的命名空间前缀SomeNameSpace。 All I am trying to do is parse the xml using etree.parse which is an API on lxml.
我要做的就是使用etree.parse解析xml,etree.parse是lxml上的API。 My question is:
我的问题是:
I see similar questions, but nothing that answers this question. 我看到类似的问题,但没有任何答案。
A while back I tried to do something similar and wasn't able to find a solution. 前一段时间,我试图做类似的事情,但找不到解决方案。 I finally wrote the script below which opens the XML file and looks for a DTD using a regex.
我最终在下面编写了脚本,该脚本打开XML文件并使用正则表达式查找DTD。 It also has an override to take the DTD path on the command line, which was a requirement I had.
它还有一个替代,可以在命令行上使用DTD路径,这是我的要求。
If lxml handles nested DTDs then the code below should work for you. 如果lxml处理嵌套的DTD,则下面的代码将为您工作。
To be honest I thought it was a bit of a hack to read the file myself, but it was the only way I found. 老实说,我认为自己读取文件有点麻烦,但这是我找到的唯一方法。
import re
import sys
import os.path
import codecs
from lxml import etree
def main(args):
if len(args)<1:
print("Not enough arguments given. Expected:")
print("\tvalidatexml <xml file name> [<dtd file name>]\n")
exit(1)
dtdRe = re.compile('.*<!DOCTYPE .* ["\'](.*\.dtd)["\']>.*')
theDtd = None
inFile = args[0]
fdir = os.path.abspath(os.path.dirname(inFile))
if len(args)==2:
theDtd = os.path.abspath(args[1])
else:
with codecs.open(args[0], 'r', 'utf-8') as inf:
for ln in inf:
mtch = dtdRe.match(ln)
if mtch:
if os.path.isabs(mtch.group(1)):
theDtd = mtch.group(1)
else:
theDtd = os.path.abspath(fdir + '/' + mtch.group(1))
break
if theDtd is None:
print("No DTD specified!")
exit(2)
if not os.path.exists(theDtd):
print("The DTD ({}) does not exist!".format(theDtd))
exit(3)
print('Using DTD:', theDtd)
parser = etree.XMLParser(dtd_validation=True)
dtd = etree.DTD(open(theDtd))
tree = etree.parse(args[0])
valid = dtd.validate(tree)
if (valid):
print("XML was valid!")
else:
print("XML was not valid:")
print(dtd.error_log.filter_from_errors())
if __name__ == '__main__':
main(sys.argv[1:])
Can you try parse by Beautiful Soup ? 您可以尝试通过Beautiful Soup解析吗? Errors still is exists?
错误仍然存在吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.