简体   繁体   中英

How to parse xml-file with directory structure

I've got an xml-file containing the directory structure for files I want to put into a tar.gz file (flattened).

How should I parse the xml to extract the path for each file?

Right now I'm using lxml and finding the paths like this:

paths = []
for case in root.iter('case'):
    for language in case.iter('language'):
        for result in language.iter('result'):
            for file in result.iter('file'):
                paths.append('/'.join([node.get('id') for node in [case, language, result, file]]))

But this feels a bit too hardcoded and it does not work well if the structure change.

I can find each file-node with root.iter('file'), but how can I get all parents/directories for each node/file? Or should I do this a (completely?) different way?

The xml looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<files batch="regular">
    <case id="case_10_some_description">
        <language id="english">
            <result id="images">
                <file id="screenshot_1.png"/>
                <file id="screenshot_2.png"/>
                <file id="screenshot_3.png"/>
                <file id="screenshot_4.png"/>
                <file id="screenshot_5.png"/>
                <file id="screenshot_6.png"/>
            </result>
        </language>
    </case>
    <case id="case_12_some_description">
        <language id="english">
            <result id="images">
                <file id="screenshot_1.png"/>
                <file id="screenshot_2.png"/>
                <file id="screenshot_3.png"/>
            </result>
        </language>
    </case>
</files>

And this is the files:

regular/case_10_some_description/english/images/screenshot_1.png
regular/case_10_some_description/english/images/screenshot_2.png
regular/case_10_some_description/english/images/screenshot_3.png
regular/case_10_some_description/english/images/screenshot_4.png
regular/case_10_some_description/english/images/screenshot_5.png
regular/case_10_some_description/english/images/screenshot_6.png
regular/case_12_some_description/english/images/screenshot_1.png
regular/case_12_some_description/english/images/screenshot_2.png
regular/case_12_some_description/english/images/screenshot_3.png

Do you create this file-schema on your own? If you can change it, i would definitly. Try to make something like this:

<?xml version="1.0" encoding="UTF-8"?>
<Directory id="regular">
    <Directory id="case_10_some_description">
        <Directory id="english">
            <Directory id="images">
                <file id="screenshot_1.png"/>
                <file id="screenshot_2.png"/>
                <file id="screenshot_3.png"/>
                <file id="screenshot_4.png"/>
                <file id="screenshot_5.png"/>
                <file id="screenshot_6.png"/>
            </Directory>
        </Directory>
    </Directory>
    <Directory id="case_12_some_description">
        <Directory id="english">
            <Directory id="images">
                <file id="screenshot_1.png"/>
                <file id="screenshot_2.png"/>
                <file id="screenshot_3.png"/>
            </Directory>
        </Directory>
    </Directory>
</Directory>

Always give tag the same name if they have the same meaning. Maybe use more different attributes than tag, is would make your parsing easier

import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for file in root.iter('file'):
    print 'regular/case_10_some_description/english/images/'+file.attrib['id']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM