I've got an xml-file containing the directory structure for files I want to put into a tar.gz file (flattened).
How should I parse the xml to extract the path for each file?
Right now I'm using lxml and finding the paths like this:
paths = []
for case in root.iter('case'):
for language in case.iter('language'):
for result in language.iter('result'):
for file in result.iter('file'):
paths.append('/'.join([node.get('id') for node in [case, language, result, file]]))
But this feels a bit too hardcoded and it does not work well if the structure change.
I can find each file-node with root.iter('file'), but how can I get all parents/directories for each node/file? Or should I do this a (completely?) different way?
The xml looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<files batch="regular">
<case id="case_10_some_description">
<language id="english">
<result id="images">
<file id="screenshot_1.png"/>
<file id="screenshot_2.png"/>
<file id="screenshot_3.png"/>
<file id="screenshot_4.png"/>
<file id="screenshot_5.png"/>
<file id="screenshot_6.png"/>
</result>
</language>
</case>
<case id="case_12_some_description">
<language id="english">
<result id="images">
<file id="screenshot_1.png"/>
<file id="screenshot_2.png"/>
<file id="screenshot_3.png"/>
</result>
</language>
</case>
</files>
And this is the files:
regular/case_10_some_description/english/images/screenshot_1.png
regular/case_10_some_description/english/images/screenshot_2.png
regular/case_10_some_description/english/images/screenshot_3.png
regular/case_10_some_description/english/images/screenshot_4.png
regular/case_10_some_description/english/images/screenshot_5.png
regular/case_10_some_description/english/images/screenshot_6.png
regular/case_12_some_description/english/images/screenshot_1.png
regular/case_12_some_description/english/images/screenshot_2.png
regular/case_12_some_description/english/images/screenshot_3.png
Do you create this file-schema on your own? If you can change it, i would definitly. Try to make something like this:
<?xml version="1.0" encoding="UTF-8"?>
<Directory id="regular">
<Directory id="case_10_some_description">
<Directory id="english">
<Directory id="images">
<file id="screenshot_1.png"/>
<file id="screenshot_2.png"/>
<file id="screenshot_3.png"/>
<file id="screenshot_4.png"/>
<file id="screenshot_5.png"/>
<file id="screenshot_6.png"/>
</Directory>
</Directory>
</Directory>
<Directory id="case_12_some_description">
<Directory id="english">
<Directory id="images">
<file id="screenshot_1.png"/>
<file id="screenshot_2.png"/>
<file id="screenshot_3.png"/>
</Directory>
</Directory>
</Directory>
</Directory>
Always give tag the same name if they have the same meaning. Maybe use more different attributes than tag, is would make your parsing easier
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for file in root.iter('file'):
print 'regular/case_10_some_description/english/images/'+file.attrib['id']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.