[英]Parsing large xml data using python's elementtree
I'm currently learning how to parse xml data using elementtree. 我目前正在学习如何使用elementtree解析xml数据。 I got an error that say:ParseError: not well-formed (invalid token): line 1, column 2.
我收到一个错误消息:ParseError:格式不正确(无效令牌):第1行,第2列。
My code is right below, and a bit of the xml data is after my code. 我的代码在下面,并且一些xml数据在我的代码之后。
import xml.etree.ElementTree as ET
tree = ET.fromstring("C:\pbc.xml")
root = tree.getroot()
for article in root.findall('article'):
print ' '.join([t.text for t in pub.findall('title')])
for author in article.findall('author'):
print 'Author name: {}'.format(author.text)
for journal in article.findall('journal'): # all venue tags with id attribute
print 'journal'
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2002-01-03" key="persons/Codd71a">
<author>E. F. Codd</author>
<title>Further Normalization of the Data Base Relational Model.</title>
<journal>IBM Research Report, San Jose, California</journal>
<volume>RJ909</volume>
<month>August</month>
<year>1971</year>
<cdrom>ibmTR/rj909.pdf</cdrom>
<ee>db/labs/ibm/RJ909.html</ee>
</article>
<article mdate="2002-01-03" key="persons/Hall74">
<author>Patrick A. V. Hall</author>
<title>Common Subexpression Identification in General Algebraic Systems.</title>
<journal>Technical Rep. UKSC 0060, IBM United Kingdom Scientific Centre</journal>
<month>November</month>
<year>1974</year>
</article>
with open("C:\pbc.xml", 'rb') as f:
root = ET.fromstring(f.read().strip())
Unlike ET.parse
, ET.fromstring
expects a string with XML content, not the name of a file. 与
ET.parse
不同, ET.fromstring
期望包含XML内容的字符串,而不是文件名。
Also in contrast to ET.parse
, ET.fromstring
returns a root Element, not a Tree. 与
ET.parse
, ET.fromstring
返回根元素,而不是树。 So you should omit 所以你应该省略
root = tree.getroot()
Also, the XML snippet you posted needs a closing </dblp>
to be parsable. 另外,您发布的XML代码段必须以
</dblp>
结尾才能解析。 I assume your real data has that closing tag... 我认为您的真实数据具有结束标记...
The iterparse provided by xml.etree.ElementTree
does not have a tag
argument, although lxml.etree.iterparse
does have a tag
argument. xml.etree.ElementTree提供的
xml.etree.ElementTree
没有tag
参数,尽管lxml.etree.iterparse
确实具有tag
参数。
Try: 尝试:
import xml.etree.ElementTree as ET
import htmlentitydefs
filename = "test.xml"
# http://stackoverflow.com/a/10792473/190597 (lambacck)
parser = ET.XMLParser()
parser.entity.update((x, unichr(i)) for x, i in htmlentitydefs.name2codepoint.iteritems())
context = ET.iterparse(filename, events = ('end', ), parser=parser)
for event, elem in context:
if elem.tag == 'article':
for author in elem.findall('author'):
print 'Author name: {}'.format(author.text)
for journal in elem.findall('journal'): # all venue tags with id attribute
print(journal.text)
elem.clear()
Note: To use iterparse
your XML must be valid, which means among other things that there can not be empty lines at the beginning of the file. 注意:要使用
iterparse
您的XML必须是有效的,这意味着除其他外,文件开头不能有空行。
You are using .fromstring()
instead of .parse()
: 您正在使用
.fromstring()
而不是.parse()
:
import xml.etree.ElementTree as ET
tree = ET.parse("C:\pbc.xml")
root = tree.getroot()
.fromstring()
expects to be given the XML data in a bytestring, not a filename. .fromstring()
期望以字节.fromstring()
而不是文件名的形式提供XML数据。
If the document is really large (many megabytes or more) then you should use the ET.iterparse()
function instead and clear elements you have processed: 如果文档确实很大(很多兆字节或更多),则应改用
ET.iterparse()
函数并清除已处理的元素:
for event, article in ET.iterparse('C:\\pbc.xml', tag='article'):
for title in aarticle.findall('title'):
print 'Title: {}'.format(title.txt)
for author in article.findall('author'):
print 'Author name: {}'.format(author.text)
for journal in article.findall('journal'):
print 'journal'
article.clear()
You'd better not putting the meta-info of the xml file into the parser. 您最好不要将xml文件的元信息放入解析器中。 The parser do well if the tags are well-closed.
如果标签封闭良好,则解析器的效果很好。 So the
<?xml
may not be recognized by the parser. 因此,解析器可能无法识别
<?xml
。 So omit the first two lines and try again. 因此,请省略前两行,然后重试。 :-)
:-)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.