Reading XML DOCTYPE info with Python

Question

I need to parse a version of an XML file as follows.

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE twReport [ 
<!ELEMENT twReport (twHead?, (twWarn | twDebug | twInfo)*, twBody, twSum?, 
               twDebug*, twFoot?, twClientInfo?)> 
<!ATTLIST twReport version CDATA "10,4"> <----- VERSION INFO HERE

I use xml.dom.minidom for parsing XML file, and I need to parse the version of the XML file written in embedded DTD.

Can I use xml.dom.minidom for this purpose?
Is there any python XML parser for that purposes?

Answer 1

How about xmlproc 's DTD api ?

Here's a random snippet of code I wrote years and years ago to do some work with DTDs from Python, which might give you an idea of what it's like to work with this library:

from xml.parsers.xmlproc import dtdparser

attr_separator = '_'
child_separator = '_'

dtd = dtdparser.load_dtd('schedule.dtd')

for name, element in dtd.elems.items():
    for attr in element.attrlist:
        output = '%s%s%s = ' % (name, attr_separator, attr)
        print output
    for child in element.get_valid_elements(element.get_start_state()):
        output = '%s%s%s = ' % (name, child_separator, child)
        print output

(FYI, this was the first result when searching for "python dtd parser" )

Answer 2

Because both of the the standard library XML libraries ( xml.dom.minidom and xml.etree ) use the same parser ( xml.parsers.expat ) you are limited in the "quality" of XML data you are able to successfully parse.

You're better off using the tried-and-true 3rd party modules out there like lxml or BeautifulSoup that are not only more resilient to errors, but will also give you exactly what you are looking for with little trouble.

Reading XML DOCTYPE info with Python

Question

2 answers

solution1
2 ACCPTED 2010-01-27 15:53:47

solution2
0 2010-01-28 14:10:46

Reading XML DOCTYPE info with Python

Question

2 answers

solution1 2 ACCPTED 2010-01-27 15:53:47

solution2 0 2010-01-28 14:10:46

solution1
2 ACCPTED 2010-01-27 15:53:47

solution2
0 2010-01-28 14:10:46