使用lxml（python）进行xml验证（DTD）

Question

There is a brief explanation of validation based on XML here . 有基于XML验证的简要说明这里。 I am trying to parse an XML file that refers to nested DTD's ie: XML file refers to a DTD, which refers to other DTD's. 我试图解析一个XML文件，该XML文件引用了嵌套的DTD，即：XML文件引用了一个DTD，它引用了其他DTD。

The error I get is Namespace prefix SomeNameSpace on Config is not defined. 我得到的错误是未定义Config上的命名空间前缀SomeNameSpace。 All I am trying to do is parse the xml using etree.parse which is an API on lxml. 我要做的就是使用etree.parse解析xml，etree.parse是lxml上的API。 My question is: 我的问题是：

Can I just turn off the validation (I am assuming the xml is correct)? 我可以关闭验证功能吗（我假设xml是正确的）？
How exactly can i provide lxml all the nested DTDs , so it doesn't complain about any of the tags? 我怎么能为lxml提供所有嵌套的DTD，所以它不会抱怨任何标签？

I see similar questions, but nothing that answers this question. 我看到类似的问题，但没有任何答案。

Answer 1

A while back I tried to do something similar and wasn't able to find a solution. 前一段时间，我试图做类似的事情，但找不到解决方案。 I finally wrote the script below which opens the XML file and looks for a DTD using a regex. 我最终在下面编写了脚本，该脚本打开XML文件并使用正则表达式查找DTD。 It also has an override to take the DTD path on the command line, which was a requirement I had. 它还有一个替代，可以在命令行上使用DTD路径，这是我的要求。

If lxml handles nested DTDs then the code below should work for you. 如果lxml处理嵌套的DTD，则下面的代码将为您工作。

To be honest I thought it was a bit of a hack to read the file myself, but it was the only way I found. 老实说，我认为自己读取文件有点麻烦，但这是我找到的唯一方法。

import re
import sys
import os.path
import codecs
from lxml import etree

def main(args):
    if len(args)<1:
        print("Not enough arguments given.  Expected:")
        print("\tvalidatexml <xml file name> [<dtd file name>]\n")
        exit(1)

    dtdRe = re.compile('.*<!DOCTYPE .* ["\'](.*\.dtd)["\']>.*')
    theDtd = None
    inFile = args[0]
    fdir = os.path.abspath(os.path.dirname(inFile))
    if len(args)==2:
        theDtd = os.path.abspath(args[1])
    else:
        with codecs.open(args[0], 'r', 'utf-8') as inf:
            for ln in inf:
                mtch = dtdRe.match(ln)
                if mtch:
                    if os.path.isabs(mtch.group(1)):
                        theDtd = mtch.group(1)
                    else:
                        theDtd = os.path.abspath(fdir + '/' + mtch.group(1))
                    break
    if theDtd is None:
        print("No DTD specified!")
        exit(2)

    if not os.path.exists(theDtd):
        print("The DTD ({}) does not exist!".format(theDtd))
        exit(3)

    print('Using DTD:', theDtd)

    parser = etree.XMLParser(dtd_validation=True)
    dtd = etree.DTD(open(theDtd))
    tree = etree.parse(args[0])

    valid = dtd.validate(tree)
    if (valid):
        print("XML was valid!")

    else:
        print("XML was not valid:")
        print(dtd.error_log.filter_from_errors())


if __name__ == '__main__':
    main(sys.argv[1:])

Answer 2

Can you try parse by Beautiful Soup ? 您可以尝试通过Beautiful Soup解析吗？ Errors still is exists? 错误仍然存在吗？

使用lxml（python）进行xml验证（DTD）

问题描述

2 个解决方案

解决方案1
1 2013-06-17 22:44:00

解决方案2
0 2013-06-17 22:32:40

使用lxml（python）进行xml验证（DTD）

问题描述

2 个解决方案

解决方案1 1 2013-06-17 22:44:00

解决方案2 0 2013-06-17 22:32:40

解决方案1
1 2013-06-17 22:44:00

解决方案2
0 2013-06-17 22:32:40