簡體   English   中英

使用lxml.etree.iterparse解析損壞的XML

[英]Parsing broken XML with lxml.etree.iterparse

我試圖以一種內存有效的方式解析一個帶有lxml的巨大xml文件(即從磁盤懶洋洋地流式傳輸而不是將整個文件加載到內存中)。 不幸的是,該文件包含一些破壞默認解析器的壞ascii字符。 如果我設置recover = True,則解析器可以工作,但是iterparse方法不會使用recover參數或自定義解析器對象。 有誰知道如何使用iterparse來解析破碎的xml?

#this works, but loads the whole file into memory
parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters.
tree = lxml.etree.parse(filename, parser)

#how do I do the equivalent with iterparse?  (using iterparse so the file can be streamed lazily from disk)
context = lxml.etree.iterparse(filename, tag='RECORD')
#record contains 6 elements that I need to extract the text from

謝謝你的幫助!

編輯 - 以下是我遇到的編碼錯誤類型的示例:

In [17]: data
Out[17]: '\t<articletext>&lt;p&gt;The cafeteria rang with excited voices.  Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town.  We, of course, were glad to entertain such a worthy group and immediately agreed .  One wag joked, "Which uniform should we wear?"  followed with, "Oh, that\'s right, they\'ll never notice."  The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.&lt;/p&gt;&lt;p&gt;A small stage was set up for us and a pretty decent P.A. system was donated for the occasion.  The audience was made up of blind persons of every age, from the thirties to the nineties.  Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally.  I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on.  After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program.  Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind.  We didn\'t mind at all that some sang along \x1e they enjoyed it so much.&lt;/p&gt;&lt;p&gt;In fact, a popular part of our program is when the audience gets to sing some of the old favorites.  The harmony parts were quite evident as they tried their voices to the different parts.  I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important.   We received a big hand at the finale and were made to promise to return the following year.  Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal.  As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?"  Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.&lt;/p&gt;&lt;p&gt;Retired portrait photographer.  Main hobby - quartet singing.&lt;/p&gt;</articletext>\n'

In [18]: lxml.etree.from
lxml.etree.fromstring      lxml.etree.fromstringlist  

In [18]: lxml.etree.fromstring(data)
---------------------------------------------------------------------------
XMLSyntaxError                            Traceback (most recent call last)

/mnt/articles/<ipython console> in <module>()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)()

XMLSyntaxError: PCDATA invalid Char value 30, line 1, column 1190

In [19]: chardet.detect(data)
Out[19]: {'confidence': 1.0, 'encoding': 'ascii'}

正如你所看到的,chardet認為它是一個ascii文件,但是在這個例子的中間有一個“\\ x1e”正在使lxml引發異常。

編輯:

這是一個較老的答案,我今天會做的不同。 而且我不僅僅是指愚蠢的嘲笑 ...從那以后BeutifulSoup4可用而且非常好。 我推薦給那些偶然發現的人。


目前接受的答案是,不應該做什么。 問題本身也有一個不好的假設:

parser = lxml.etree.XMLParser(recover = True) #從錯誤的字符中恢復

實際上, recover=True用於從錯誤的XML中恢復。 然而,有一個“編碼”選項可以解決您的問題。

parser = lxml.etree.XMLParser(encoding='utf-8' #Your encoding issue.
                              recover=True, #I assume you probably still want to recover from bad xml, it's quite nice. If not, remove.
                              )

就是這樣,這就是解決方案。


順便說一句-對於任何與蟒蛇解析XML掙扎,尤其是來自第三方。 我知道,我知道,文件很糟糕,並且有很多SO紅色的鯡魚; 很多不好的建議。

  • lxml.etree.fromstring()? - 這是完美形成的XML,愚蠢
  • BeautifulStoneSoup? - 慢,並且對自我關閉標簽有一個愚蠢的政策
  • lxml.etree.HTMLParser()? - (因為xml壞了)這是一個秘密 - HTMLParser()是...一個具有recover = True的解析器
  • lxml.html.soupparser? - 編碼檢測應該更好,但它與BeautifulSoup的自閉標簽有相同的缺陷。 也許您可以將XMLParser與BeautifulSoup的UnicodeDammit結合使用
  • UnicodeDammit和其他cockamamie東西來修復編碼? - 嗯,UnicodeDammit有點可愛,我喜歡它的名字,它對xml之外的東西很有用,但是如果你用XMLParser()做正確的事情,事情通常是固定的

你可以嘗試在網上提供各種各樣的東西。 lxml文檔可能會更好。 上面的代碼是90%的XML解析案例所需的代碼。 在這里,我將重申一下:

magical_parser = XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object

別客氣。 我的頭痛==你的理智。 此外,它還具有您可能需要的其他功能,您知道,XML。

我通過創建一個類似於對象接口的類來解決問題。 類'read()方法從文件中讀取一行,並在將行返回到iterparse之前替換任何“壞字符”。

#psudo code

class myFile(object):
    def __init__(self, filename):
        self.f = open(filename)

    def read(self, size=None):
        return self.f.next().replace('\x1e', '').replace('some other bad character...' ,'')


#iterparse
context = lxml.etree.iterparse(myFile('bigfile.xml', tag='RECORD')

我不得不編輯myFile類幾次為一些其他正在制作lxml choke的字符添加更多的replace()調用。 我認為lxml的SAX解析也會起作用(似乎支持恢復選項),但這個解決方案就像一個魅力!

編輯你的問題,說明發生了什么(確切的錯誤信息和追溯(復制/粘貼,不要從內存中輸入))讓你認為“壞unicode”是問題所在。

獲取chardet並將其提供給您的MySQL轉儲。 告訴我們它的內容。

使用例如print repr(dump[:300])向我們顯示轉儲的前200到300個字節

更新你寫了“”“正如你所看到的,chardet認為它是一個ascii文件,但是在這個例子的中間有一個”\\ x1e“正在使lxml引發異常。”“”

我在這里看不到“壞的unicode”。

chardet是對的。 是什么讓你認為“\\ x1e”不是ASCII? 它是一個ASCII字符,一個名為“RECORD SEPARATOR”的C0控制字符。

錯誤消息表明您的字符無效。 那也是對的。 XML中唯一有效的控制字符是"\\t""\\r""\\n" MySQL應該對此抱怨和/或為你提供一種逃避它的方法,例如_x001e_ (yuk!)

鑒於上下文,看起來該字符可以刪除而不會丟失。 您可能希望修復您的數據庫,或者您可能希望從轉儲中刪除這些類似的字符(在檢查它們都是可以消除的之后),或者您可能希望選擇比XML更挑剔且不那么笨拙的輸出格式。

更新2你可能想要使用iterparse()不是因為它是你的最終目標,而是因為你想節省內存。 如果您使用CSV格式,則不會出現內存問題。

更新3回應@Purrell的評論:

自己試一試,伙計。 pastie.org/3280965

這是牧師的內容; 值得保護:

from lxml.etree import etree

data = '\t<articletext>&lt;p&gt;The cafeteria rang with excited voices.  Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town.  We, of course, were glad to entertain such a worthy group and immediately agreed .  One wag joked, "Which uniform should we wear?"  followed with, "Oh, that\'s right, they\'ll never notice."  The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.&lt;/p&gt;&lt;p&gt;A small stage was set up for us and a pretty decent P.A. system was donated for the occasion.  The audience was made up of blind persons of every age, from the thirties to the nineties.  Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally.  I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on.  After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program.  Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind.  We didn\'t mind at all that some sang along \x1e they enjoyed it so much.&lt;/p&gt;&lt;p&gt;In fact, a popular part of our program is when the audience gets to sing some of the old favorites.  The harmony parts were quite evident as they tried their voices to the different parts.  I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important.   We received a big hand at the finale and were made to promise to return the following year.  Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal.  As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?"  Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.&lt;/p&gt;&lt;p&gt;Retired portrait photographer.  Main hobby - quartet singing.&lt;/p&gt;</articletext>\n'

magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(data), magical_parser)

要使其運行,需要修復一個導入,並提供另一個導入。 數據是可怕的。 沒有輸出來顯示結果。 這是一個替代數據,直到最基本的數據。 將所有有效XML字符的5個ASCII文本(不包括&lt;&gt; )替換為t1 ,..., t5 違規的\\x1e側面是t2t3

[output wraps at column 80]
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> from cStringIO import StringIO
>>> data = '<article>&lt;p&gt;t1&lt;/p&gt;&lt;p&gt;t2\x1et3&lt;/p&gt;&lt;p&gt;t4
&lt;/p&gt;&lt;p&gt;t5&lt;/p&gt;</article>'
>>> magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
>>> tree = etree.parse(StringIO(data), magical_parser)
>>> print(repr(tree.getroot().text))
'<p>t1</p><p>t2t3/ppt4/ppt5/p'

不是我稱之為“恢復”; 在壞字符之后, <>字符消失。

牧師回答了我的問題“什么讓你覺得編碼='utf-8'會解決他的問題?”。 這是由聲明'有一個“編碼”選項可以解決你的問題而引發的。 但encoding = ascii產生相同的輸出。 所以省略編碼arg。 這不是編碼問題。 案件結案。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM