简体   繁体   English

使用lxml.etree.iterparse解析损坏的XML

[英]Parsing broken XML with lxml.etree.iterparse

I'm trying to parse a huge xml file with lxml in a memory efficient manner (ie streaming lazily from disk instead of loading the whole file in memory). 我试图以一种内存有效的方式解析一个带有lxml的巨大xml文件(即从磁盘懒洋洋地流式传输而不是将整个文件加载到内存中)。 Unfortunately, the file contains some bad ascii characters that break the default parser. 不幸的是,该文件包含一些破坏默认解析器的坏ascii字符。 The parser works if I set recover=True, but the iterparse method doesn't take the recover parameter or a custom parser object. 如果我设置recover = True,则解析器可以工作,但是iterparse方法不会使用recover参数或自定义解析器对象。 Does anyone know how to use iterparse to parse broken xml? 有谁知道如何使用iterparse来解析破碎的xml?

#this works, but loads the whole file into memory
parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters.
tree = lxml.etree.parse(filename, parser)

#how do I do the equivalent with iterparse?  (using iterparse so the file can be streamed lazily from disk)
context = lxml.etree.iterparse(filename, tag='RECORD')
#record contains 6 elements that I need to extract the text from

Thanks for your help! 谢谢你的帮助!

EDIT -- Here is an example of the types of encoding errors I'm running into: 编辑 - 以下是我遇到的编码错误类型的示例:

In [17]: data
Out[17]: '\t<articletext>&lt;p&gt;The cafeteria rang with excited voices.  Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town.  We, of course, were glad to entertain such a worthy group and immediately agreed .  One wag joked, "Which uniform should we wear?"  followed with, "Oh, that\'s right, they\'ll never notice."  The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.&lt;/p&gt;&lt;p&gt;A small stage was set up for us and a pretty decent P.A. system was donated for the occasion.  The audience was made up of blind persons of every age, from the thirties to the nineties.  Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally.  I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on.  After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program.  Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind.  We didn\'t mind at all that some sang along \x1e they enjoyed it so much.&lt;/p&gt;&lt;p&gt;In fact, a popular part of our program is when the audience gets to sing some of the old favorites.  The harmony parts were quite evident as they tried their voices to the different parts.  I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important.   We received a big hand at the finale and were made to promise to return the following year.  Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal.  As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?"  Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.&lt;/p&gt;&lt;p&gt;Retired portrait photographer.  Main hobby - quartet singing.&lt;/p&gt;</articletext>\n'

In [18]: lxml.etree.from
lxml.etree.fromstring      lxml.etree.fromstringlist  

In [18]: lxml.etree.fromstring(data)
---------------------------------------------------------------------------
XMLSyntaxError                            Traceback (most recent call last)

/mnt/articles/<ipython console> in <module>()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)()

XMLSyntaxError: PCDATA invalid Char value 30, line 1, column 1190

In [19]: chardet.detect(data)
Out[19]: {'confidence': 1.0, 'encoding': 'ascii'}

As you can see, chardet thinks it is an ascii file, but there is a "\\x1e" right in the middle of this example which is making lxml raise an exception. 正如你所看到的,chardet认为它是一个ascii文件,但是在这个例子的中间有一个“\\ x1e”正在使lxml引发异常。

Edit: 编辑:

This is an older answer and I would have done it differently today. 这是一个较老的答案,我今天会做的不同。 And I'm not just referring to the dumb snark ... since then BeutifulSoup4 is available and it's really quite nice. 而且我不仅仅是指愚蠢的嘲笑 ...从那以后BeutifulSoup4可用而且非常好。 I recommend that to anyone who stumbles over here. 我推荐给那些偶然发现的人。


The currently accepted answer is, well, not what one should do. 目前接受的答案是,不应该做什么。 The question itself also has a bad assumption: 问题本身也有一个不好的假设:

parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters. parser = lxml.etree.XMLParser(recover = True) #从错误的字符中恢复

Actually recover=True is for recovering from misformed XML . 实际上, recover=True用于从错误的XML中恢复。 There is however an "encoding" option which would have fixed your issue. 然而,有一个“编码”选项可以解决您的问题。

parser = lxml.etree.XMLParser(encoding='utf-8' #Your encoding issue.
                              recover=True, #I assume you probably still want to recover from bad xml, it's quite nice. If not, remove.
                              )

That's it, that's the solution. 就是这样,这就是解决方案。


BTW -- For anyone struggling with parsing XML in python, especially from third party sources. 顺便说一句-对于任何与蟒蛇解析XML挣扎,尤其是来自第三方。 I know, I know, the documentation is bad and there are a lot of SO red herrings; 我知道,我知道,文件很糟糕,并且有很多SO红色的鲱鱼; a lot of bad advice. 很多不好的建议。

  • lxml.etree.fromstring()? lxml.etree.fromstring()? - That's for perfectly formed XML, silly - 这是完美形成的XML,愚蠢
  • BeautifulStoneSoup? BeautifulStoneSoup? - Slow, and has a way-stupid policy for self closing tags - 慢,并且对自我关闭标签有一个愚蠢的政策
  • lxml.etree.HTMLParser()? lxml.etree.HTMLParser()? - (because the xml is broken) Here's a secret - HTMLParser() is... a Parser with recover=True - (因为xml坏了)这是一个秘密 - HTMLParser()是...一个具有recover = True的解析器
  • lxml.html.soupparser? lxml.html.soupparser? - The encoding detection is supposed to be better, but it has the same failings of BeautifulSoup for self closing tags. - 编码检测应该更好,但它与BeautifulSoup的自闭标签有相同的缺陷。 Perhaps you can combine XMLParser with BeautifulSoup's UnicodeDammit 也许您可以将XMLParser与BeautifulSoup的UnicodeDammit结合使用
  • UnicodeDammit and other cockamamie stuff to fix encodings? UnicodeDammit和其他cockamamie东西来修复编码? - Well, UnicodeDammit is kind of cute, I like the name and it's useful for stuff beyond xml, but things are usually fixed if you do the right thing with XMLParser() - 嗯,UnicodeDammit有点可爱,我喜欢它的名字,它对xml之外的东西很有用,但是如果你用XMLParser()做正确的事情,事情通常是固定的

You could be trying all sorts of stuff from what's available online. 你可以尝试在网上提供各种各样的东西。 lxml documentation could be better. lxml文档可能会更好。 The code above is what you need for 90% of your XML parsing cases. 上面的代码是90%的XML解析案例所需的代码。 Here I'll restate it: 在这里,我将重申一下:

magical_parser = XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object

You're welcome. 别客气。 My headaches == your sanity. 我的头痛==你的理智。 Plus it has other features you might need for, you know, XML. 此外,它还具有您可能需要的其他功能,您知道,XML。

I solved the problem by creating a class with a File like object interface. 我通过创建一个类似于对象接口的类来解决问题。 The class' read() method reads a line from the file and replaces any "bad characters" before returning the line to iterparse. 类'read()方法从文件中读取一行,并在将行返回到iterparse之前替换任何“坏字符”。

#psudo code

class myFile(object):
    def __init__(self, filename):
        self.f = open(filename)

    def read(self, size=None):
        return self.f.next().replace('\x1e', '').replace('some other bad character...' ,'')


#iterparse
context = lxml.etree.iterparse(myFile('bigfile.xml', tag='RECORD')

I had to edit the myFile class a few times adding some more replace() calls for a few other characters that were making lxml choke. 我不得不编辑myFile类几次为一些其他正在制作lxml choke的字符添加更多的replace()调用。 I think lxml's SAX parsing would have worked as well (seems to support the recover option), but this solution worked like a charm! 我认为lxml的SAX解析也会起作用(似乎支持恢复选项),但这个解决方案就像一个魅力!

Edit your question, stating what happens (exact error message and traceback (copy/paste, don't type from memory)) to make you think that "bad unicode" is the problem. 编辑你的问题,说明发生了什么(确切的错误信息和追溯(复制/粘贴,不要从内存中输入))让你认为“坏unicode”是问题所在。

Get chardet and feed it your MySQL dump. 获取chardet并将其提供给您的MySQL转储。 Tell us what it says. 告诉我们它的内容。

Show us the first 200 to 300 bytes of your dump, using eg print repr(dump[:300]) 使用例如print repr(dump[:300])向我们显示转储的前200到300个字节

Update You wrote """As you can see, chardet thinks it is an ascii file, but there is a "\\x1e" right in the middle of this example which is making lxml raise an exception.""" 更新你写了“”“正如你所看到的,chardet认为它是一个ascii文件,但是在这个例子的中间有一个”\\ x1e“正在使lxml引发异常。”“”

I see no "bad unicode" here. 我在这里看不到“坏的unicode”。

chardet is correct. chardet是对的。 What makes you think that "\\x1e" is not ASCII? 是什么让你认为“\\ x1e”不是ASCII? It is an ASCII character, a C0 control character named "RECORD SEPARATOR". 它是一个ASCII字符,一个名为“RECORD SEPARATOR”的C0控制字符。

The error message says that you have an invalid character. 错误消息表明您的字符无效。 That is also correct. 那也是对的。 The only control characters that are valid in XML are "\\t" , "\\r" and "\\n" . XML中唯一有效的控制字符是"\\t""\\r""\\n" MySQL should be grumbling about that and/or offering you a way of escaping it eg _x001e_ (yuk!) MySQL应该对此抱怨和/或为你提供一种逃避它的方法,例如_x001e_ (yuk!)

Given the context, it looks like that character could be deleted with no loss. 鉴于上下文,看起来该字符可以删除而不会丢失。 You may wish to fix your database or you may wish to remove suchlike characters from your dump (after checking that they are all vanishable) or you may wish to choose a less picky and less volumnious output format than XML. 您可能希望修复您的数据库,或者您可能希望从转储中删除这些类似的字符(在检查它们都是可以消除的之后),或者您可能希望选择比XML更挑剔且不那么笨拙的输出格式。

Update 2 You presumably want to user iterparse() not because it's your end goal but because you want to save memory. 更新2你可能想要使用iterparse()不是因为它是你的最终目标,而是因为你想节省内存。 If you used a format like CSV you wouldn't have a memory problem. 如果您使用CSV格式,则不会出现内存问题。

Update 3 In response to a comment by @Purrell: 更新3回应@Purrell的评论:

try it yourself, dude. 自己试一试,伙计。 pastie.org/3280965 pastie.org/3280965

Here's the contents of that pastie; 这是牧师的内容; it deserves preservation: 值得保护:

from lxml.etree import etree

data = '\t<articletext>&lt;p&gt;The cafeteria rang with excited voices.  Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town.  We, of course, were glad to entertain such a worthy group and immediately agreed .  One wag joked, "Which uniform should we wear?"  followed with, "Oh, that\'s right, they\'ll never notice."  The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.&lt;/p&gt;&lt;p&gt;A small stage was set up for us and a pretty decent P.A. system was donated for the occasion.  The audience was made up of blind persons of every age, from the thirties to the nineties.  Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally.  I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on.  After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program.  Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind.  We didn\'t mind at all that some sang along \x1e they enjoyed it so much.&lt;/p&gt;&lt;p&gt;In fact, a popular part of our program is when the audience gets to sing some of the old favorites.  The harmony parts were quite evident as they tried their voices to the different parts.  I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important.   We received a big hand at the finale and were made to promise to return the following year.  Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal.  As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?"  Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.&lt;/p&gt;&lt;p&gt;Retired portrait photographer.  Main hobby - quartet singing.&lt;/p&gt;</articletext>\n'

magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(data), magical_parser)

To get it to run, one import needs to be fixed, and another supplied. 要使其运行,需要修复一个导入,并提供另一个导入。 The data is monstrous. 数据是可怕的。 There is no output to show the result. 没有输出来显示结果。 Here's a replacement with the data cut down to the bare essentials. 这是一个替代数据,直到最基本的数据。 The 5 pieces of ASCII text (excluding &lt; and &gt; ) that are all valid XML characters are replaced by t1 , ..., t5 . 将所有有效XML字符的5个ASCII文本(不包括&lt;&gt; )替换为t1 ,..., t5 The offending \\x1e is flanked by t2 and t3 . 违规的\\x1e侧面是t2t3

[output wraps at column 80]
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> from cStringIO import StringIO
>>> data = '<article>&lt;p&gt;t1&lt;/p&gt;&lt;p&gt;t2\x1et3&lt;/p&gt;&lt;p&gt;t4
&lt;/p&gt;&lt;p&gt;t5&lt;/p&gt;</article>'
>>> magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
>>> tree = etree.parse(StringIO(data), magical_parser)
>>> print(repr(tree.getroot().text))
'<p>t1</p><p>t2t3/ppt4/ppt5/p'

Not what I'd call "recovery"; 不是我称之为“恢复”; after the bad character, the < and > characters disappear. 在坏字符之后, <>字符消失。

The pastie was in response to my question "What gives you the idea that encoding='utf-8' will solve his problem?". 牧师回答了我的问题“什么让你觉得编码='utf-8'会解决他的问题?”。 This was triggered by the statement 'There is however an "encoding" option which would have fixed your issue.' 这是由声明'有一个“编码”选项可以解决你的问题而引发的。 But encoding=ascii produces the same output. 但encoding = ascii产生相同的输出。 So does omitting the encoding arg. 所以省略编码arg。 It's NOT an encoding problem. 这不是编码问题。 Case closed. 案件结案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM