简体   繁体   中英

Extract information from XML

I'm using etree module. I'm trying to extract the information around <text ...> tag. Here is my XML file . I want if <text ...">{{Infobox film start with Infobox film then copy all the text between {{ }} . Is it possible? thanks

Update: XML file updated

The following snippet should do what you want:

import re
from xml.etree import ElementTree                                               

with open('films.xml') as f:                                                    
    xml = ElementTree.parse(f)                                                  

for t in xml.findall('.//{http://www.mediawiki.org/xml/export-0.5/}text'):
    print '===================='
    m = re.search(r'(?s).*?{{(Infobox film.*?)}}', t.text)
    if m:
        print m.group(1)

The regular expression there begins with (?s) , which turns on the DOTALL option, meaning that . matches newlines as well as any other character. The two instances of .*? are non-greedy matches of any charcter - ie they will find the shortest stretch of zero or more characters until the rest of the expression can be matched.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM