[英]Read XML with multiple top-level items using Python ElementTree?
如果XML有多个顶级项目,如何使用Python ElementTree读取XML文件?
我有一个XML文件,我想用Python ElementTree阅读。
不幸的是,它有多个顶级标签。 我将围绕XML包装<doc>...</doc>
,除了我必须在<?xml>
和<!DOCTYPE>
字段之后放置<doc>
。 但弄明白<!DOCTYPE>
结束的地方并不重要。
我有的:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE FOO BAR "foo.dtd" [
<!ENTITY ...>
<!ENTITY ...>
<!ENTITY ...>
]>
<ARTICLE> ... </ARTICLE>
<ARTICLE> ... </ARTICLE>
<ARTICLE> ... </ARTICLE>
<ARTICLE> ... </ARTICLE>
我想要的是:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE FOO BAR "foo.dtd" [
<!ENTITY ...>
<!ENTITY ...>
<!ENTITY ...>
]>
<DOC>
<ARTICLE> ... </ARTICLE>
<ARTICLE> ... </ARTICLE>
<ARTICLE> ... </ARTICLE>
<ARTICLE> ... </ARTICLE>
</DOC>
注意,标签ARTICLE的名称可能会改变,所以我不能为它而烦恼。
任何人都可以向我建议如何在XML标题后添加封闭的<doc>...</doc>
,或建议另一种解决方法?
我编写了以下函数来在XML处理指令之后添加一个顶级标记。 您现在可以在我的常用Python库中找到此代码,如common.myelementtree.add_toplevel_tag
import re
xmlprocre = re.compile("(\s*<[\?\!])")
def add_toplevel_tag(string):
"""
After all the XML processing instructions, add an enclosing top-level <DOC> tag, and return it.
e.g.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE FOO BAR "foo.dtd" [ <!ENTITY ...> <!ENTITY ...> <!ENTITY ...> ]> <ARTICLE> ...
</ARTICLE> <ARTICLE> ... </ARTICLE> <ARTICLE> ... </ARTICLE> <ARTICLE> ... </ARTICLE>
=>
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE FOO BAR "foo.dtd" [ <!ENTITY ...> <!ENTITY ...> <!ENTITY ...> ]><DOC> <ARTICLE> ...
</ARTICLE> <ARTICLE> ... </ARTICLE> <ARTICLE> ... </ARTICLE> <ARTICLE> ... </ARTICLE></DOC>
"""
def _advance_proc(string, idx):
# If possible, advance over whitespace and one processing
# instruction starting at string index idx, and return its index.
# If not possible, return None
# Find the beginning of the processing instruction
m = xmlprocre.match(string[idx:])
if m is None: return None
#print "Group", m.group(1)
idx = idx + len(m.group(1))
#print "Remain", string[idx:]
# Find closing > bracket
bracketdebt = 1
while bracketdebt > 0:
if string[idx] == "<": bracketdebt += 1
elif string[idx] == ">": bracketdebt -= 1
idx += 1
#print "Remain", string[idx:]
return idx
loc = 0
while 1:
# Advance one processing instruction
newloc = _advance_proc(string, loc)
if newloc is None: break
else: loc = newloc
return string[:loc] + "<DOC>" + string[loc:] + "</DOC>"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.