简体   繁体   中英

Parsing XML files and validating against xsd schema

This is an example of XML output I need to parse and validate against the schema xsd files.

<Record_Delimiter DocumentID="1.1" DocumentType="PARENT" DocumentName="SCHOOL" RelatedDocumentID=""/>
<xs:SCHOOL>
    <xs:Name>some name</xs:Name>
    <xs:ID>5908390481</xs:ID>
    <xs:Address>some address</xs:Address>
</xs:SCHOOL>
<Record_Delimiter DocumentID="1.2" DocumentType="CHILD" DocumentName="STUDENTEXP" RelatedDocumentID="1.1"/>
<xs:STUDENTEXP>
    <xs:STUDENT>
        <xs:Name>some name</xs:Name>
        <xs:SID>s1036456</xs:SID>
        <xs:Age>12</xs:Age>
        <xs:Address>some address</xs:Address>
                <xs:Expenses>
                <xs:Fees>800</xs:Fees>
                <xs:Books>100</xs:Books>
                <xs:Uniform>50</xs:Uniform>
                <xs:Transport>10</xs:Transport>
            </xs:Expenses>
    </xs:STUDENT>
</xs:STUDENTEXP>
<Record_Delimiter DocumentID="1.3" DocumentType="CHILD" DocumentName="STUDENTEXP" RelatedDocumentID="1.1"/>
<xs:STUDENTEXP>
    <xs:STUDENT>
        <xs:Name>some name</xs:Name>
        <xs:SID>s1036789</xs:SID>
        <xs:Age>15</xs:Age>
        <xs:Address>some address</xs:Address>
        <xs:Expenses>
            <xs:Fees>1000</xs:Fees>
            <xs:Books>200</xs:Books>
            <xs:Uniform>50</xs:Uniform>
            <xs:Transport>10</xs:Transport>
        </xs:Expenses>
    </xs:STUDENT>
</xs:STUDENTEXP>

This file itself is not valid XML because there is no single tag wrapping all the other tags. But each record (ie, SCHOOL and STUDENTEXP)is valid XML and it validates against the schema (school.xsd, studentexp.xsd).

I never worked with this format and not sure about few things, like how to parse such a file programmatically? Normally using lxml, we can validate each record if it was in a separate file:

xmlschema = etree.XMLSchema(etree.parse('./studentexp.xsd'))
xmlschema.assertValidate(etree.parse('./sampleStudentexp.xml'))  

What is the proper way to extract the "records" and validate them separately?

This question has been asked before: Parse a xml file with multiple root element in python

I suspect there there is a single-pass solution that involves using a stream parser. My Python isn't strong enough to work out whether it's possible. Anyway - one of the solutions in that thread might be good enough.

lxml has event parsing based on tags. incremental-event-parsing and the below worked.

parser = etree.XMLPullParser(events=('start', 'end'))
events = parser.read_events()

with open('.\sample.xml', 'rb') as f:
    d1 = deque()
    for line in f:
        parser.feed(line)
        for action, e in events:
            if action == 'start':
                d1.append(e.tag)
            elif action == 'end' and len(d1) == 1:
                if d1.pop() == e.tag:
                    root = parser.close()
                    print(etree.tostring(root, pretty_print=True, encoding="UTF-8").decode("UTF-8"))
            else:
                d1.pop()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM