简体   繁体   中英

parsing an unknown tag xml file

I was trying to parse an xml file. My problem is same as this:

parsing an xml file for unknown elements using python ElementTree

And I tried the solution of untubu.

It works great. But only for the lines which have single tags

For example:

   <some_root_name>
<tag_x>bubbles</tag_x>
 </some_root_name>

This works great But if it is like:

src = '''\
<review type="review"><link>http://www.openlist.com/new-york-ny/mickey-mantles/27612417/?numReviews=178</link>
'''

it fails.. I have many instances like this. I don't want to go beyond native libraries usage because after this I will run the code on different computer (prod env) and I will have to set the libraries there.. and it gets messy..

Is there a way , i can modify the original solution to solve this out. Thanks.

The code from above link:

import xml.sax as sax
import xml.sax.handler as saxhandler
import pprint

class TagParser(saxhandler.ContentHandler):
    # http://docs.python.org/library/xml.sax.handler.html#contenthandler-objects
    def __init__(self):
        self.tags = {}
    def startElement(self, name, attrs):
        self.tag = name
    def endElement(self, name):
        if self.tag:
            self.tags[self.tag] = self.data
            self.tag = None
            self.data = None
    def characters(self, content):
        self.data = content

parser = TagParser()
src = '''\
<some_root_name>
    <tag_x>bubbles</tag_x>
    <tag_y>car</tag_y>
    <tag...>42</tag...>
</some_root_name>'''
sax.parseString(src, parser)
pprint.pprint(parser.tags)

Exception trace:

File "extract_xml.py", line 59, in unittest
  sax.parseString(src, parser)
File "C:\Python27\lib\xml\sax\__init__.py", line 49, in parseString
  parser.parse(inpsrc)
File "C:\Python27\lib\xml\sax\expatreader.py", line 107, in parse
  xmlreader.IncrementalParser.parse(self, source)
File "C:\Python27\lib\xml\sax\xmlreader.py", line 125, in parse
  self.close()
File "C:\Python27\lib\xml\sax\expatreader.py", line 217, in close
  self.feed("", isFinal = 1)
File "C:\Python27\lib\xml\sax\expatreader.py", line 211, in feed
  self._err_handler.fatalError(exc)
File "C:\Python27\lib\xml\sax\handler.py", line 38, in fatalError
  raise exception
xml.sax._exceptions.SAXParseException: <unknown>:2:4: no element found

The TagParser uses endElement to add data to self.tags .

With src equal to

src = '''\
<review type="review"><link>http://www.openlist.com/new-york-ny/mickey-mantles/27612417/?numReviews=178</link></review>
'''

The <review> has no closing tag, </review> , so endElement never gets called.

If you add a closing </review> tag to src :

src = '''\
<review type="review"><link>http://www.openlist.com/new-york-ny/mickey-mantles/27612417/?numReviews=178</link></review>
'''

then the program yields

{u'link': u'http://www.openlist.com/new-york-ny/mickey-mantles/27612417/?numReviews=178'}

This actually works just fine, despite what your question says:

parser = TagParser()
src = '''\
<some_root_name>
    <tag_x>bubbles</tag_x>
    <tag_y>car</tag_y>
    <tag...>42</tag...>
</some_root_name>'''
sax.parseString(src, parser)
pprint.pprint(parser.tags)

parser.tags ends up as:

{u'tag...': u'42', u'tag_x': u'bubbles', u'tag_y': u'car'}

Your other example does fail, but only because it's not valid XML:

src = '''<review type="review"><link>http://www.openlist.com/new-york-ny/mickey-mantles/27612417/?numReviews=178</link>'''
parser = TagParser()
sax.parseString(src, parser)
pprint.pprint(parser.tags)

The review tag is never closed in your source, therefore this is not a valid XML fragment, therefore it raises an exception when you try to parse it.

If your problem is that you're taking incomplete fragments out of a valid document, don't do that; take the entire review tag and parse it, rather than trying to parse a single line out of it.

If your problem is that the source data is actually not valid XML, you need to use a parser designed to handle broken XML, like BeautifulSoup ; neither ElementTree nor xml.sax is going to work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM