I'm using etree module. I'm trying to extract the information around <text ...>
tag. Here is my XML file . I want if <text ...">{{Infobox film
start with Infobox film
then copy all the text between {{ }}
. Is it possible? thanks
Update: XML file updated
The following snippet should do what you want:
import re
from xml.etree import ElementTree
with open('films.xml') as f:
xml = ElementTree.parse(f)
for t in xml.findall('.//{http://www.mediawiki.org/xml/export-0.5/}text'):
print '===================='
m = re.search(r'(?s).*?{{(Infobox film.*?)}}', t.text)
if m:
print m.group(1)
The regular expression there begins with (?s)
, which turns on the DOTALL
option, meaning that .
matches newlines as well as any other character. The two instances of .*?
are non-greedy matches of any charcter - ie they will find the shortest stretch of zero or more characters until the rest of the expression can be matched.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.