I have file as such
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans scribe="MSPLAB" audio_filename="Combine001" version="5" version_date="110525">
<Episode>
<Section type="report" startTime="0" endTime="2613.577">
<Turn startTime="0" endTime="308.0620625">
<Sync time="0"/>
<Event desc="music" type="noise" extent="instantaneous"/>
<Sync time="2.746"/>
TARGET_TEXT1
<Sync time="5.982"/>
TARGET_TEXT2
</Turn>
</Section>
</Episode>
</Trans>
Is this considered well-formed xml file? I am trying to extract TARGET_TEXT1
and TARGET_TEXT2
in Python but I don't understand where these content belong to as it is between self-closing tags. I saw this other post here but it is done in Java.
Using itertext from ElementTree
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
data = [text.strip() for node in root.findall('.//Turn') for text in node.itertext() if text.strip()]
print(data)
Output:
['TARGET_TEXT1', 'TARGET_TEXT2']
Update: If you want dictionary as output try this:
data = {float(x.attrib['time']): x.tail.strip() for node in root.findall('.//Turn') for x in node if x.tail.strip()}
#{2.746: 'TARGET_TEXT1', 5.982: 'TARGET_TEXT2'}
an alternative, using xpath via parsel :
from parsel import Selector
#xml is wrapped into a variable called data
selector = Selector(text=data, type="xml")
selector.xpath(".//Turn/text()").re("\w+")
['TARGET_TEXT1', 'TARGET_TEXT2']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.