简体   繁体   中英

How do i parse a xml comment properly in python

i have been using Python recently and i want to extract information from a given xml file. The problem is that the information is really badly stored, in a format like this

<Content>
   <tags>
   ....
   </tags>
<![CDATA["string1"; "string2"; ....
]]>
</Content>

I can not post the entire data here, since it is about 20.000 lines. I just want to recieve the list containing ["string1", "string2", ...] and this is the code i have been using so far:

import xml.etree.ElementTree as ET

tree = ET.parse(xmlfile)
for node in tree.iter('Content'):
    print (node.text)

However my output is none. How can i recieve the comment data? (again, I am using Python)

The problem is that your comment does not seem to be standard. The standard comment is <!--Comment here--> like this.

And these kind of comments can be parsed with Beautifulsoup for example:

from bs4 import BeautifulSoup, Comment

xml = """<Content>
   <tags>
   ...
   </tags>
<!--[CDATA["string1"; "string2"; ....]]-->
</Content>"""
soup = BeautifulSoup(xml)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
print(comments)

This returns ['[CDATA["string1"; "string2"; ....]]'] ['[CDATA["string1"; "string2"; ....]]'] ['[CDATA["string1"; "string2"; ....]]'] From where it could be easy to parse further into the required strings.

If you have non standard comments, i would recommend a regular expression like:

import re
xml = """<Content>
   <tags>
   asd
   </tags>
<![CDATA["string1"; "string2"; ....]]>
</Content>"""
for i in re.findall("<!.+>",xml):
    for j in re.findall('\".+\"', i):
        print(j)

This returns: "string1"; "string2" "string1"; "string2"

You'll want to create a SAX based parser instead of a DOM based parser. Especially with a document as large as yours.

A sax based parser requires you to write your own control logic in how data is stored. It's more complicated than simply loading it into a DOM, but much faster as it loads line by line and not the entire document at once. Which gives it the advantage that it can deal with squirrely cases like yours with comments.

When you build your handler, you'll probably want to use the LexicalHandler in your parser to pull out those comments.

I'd give you a working example on how to build one, but it's been a long time since I've done it myself. There's plenty of guides on how to build a sax based parser online, and will defer that discussion to another thread.

With Python 3.8 you can insert Comment in Element TREE

A sample code to read attrs, value, tag and comment in XML

import csv, sys
import xml.etree.ElementTree as ET


parser = ET.XMLParser(target=ET.TreeBuilder(insert_comments=True))  # Python 3.8
            tree = ET.parse(infile_path, parser)

            csvwriter.writerow(TextWorkAdapter.CSV_HEADERS)

            COMMENT = ""
            TAG =""
            NAME=""

            # Get the comment nodes
            for node in tree.iter():
                if "function Comment" in str(node.tag):
                    COMMENT = node.text
                else:
                    #read tag
                    TAG = node.tag  # string

                    #read attributes 
                    NAME= node.attrib.get("name")  # ID
                      
                    #Value
                    VALUE = node.text  # value

                    print(TAG, NAME, VALUE, COMMENT)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM