简体   繁体   中英

Get Text for XML-Node including childnodes (or something like this)

I have to get the pure text out of a xml-node and its child nodes, or what else these strange inner-tags are:

Example-Nodes:

<BookTitle>
<Emphasis Type="Italic">Z</Emphasis>
 = 63 - 100
</BookTitle>

or:

<BookTitle>
Mtn
<Emphasis Type="Italic">Z</Emphasis>
 = 74 - 210
</BookTitle>

I have to get:

Z = 63 - 100
Mtn Z = 74 - 210

Remember, this is just an example! There could be any type of "Child-Nodes" inside the BookTitle-Node, and all I need is the pure Text inside BookTitle.

I tried:

tagtext = root.find('.//BookTitle').text
print tagtext

but .text can't deal with this strange xml-nodes and gives me a "NoneType" back

Regards & Thanks!

That's not the text of the BookTitle node, it's the tail of the Emphasis node. So you should do something like:

def parse(el):
    text = el.text.strip() + ' ' if el.text.strip() else ''
    for child in el.getchildren():
        text += '{0} {1}\n'.format(child.text.strip(), child.tail.strip())
    return text

Which gives you:

>>> root = et.fromstring('''
    <BookTitle>
    <Emphasis Type="Italic">Z</Emphasis>
     = 63 - 100
    </BookTitle>''')
>>> print parse(root)
Z = 63 - 100

And for:

>>> root = et.fromstring('''
<BookTitle>
Mtn
<Emphasis Type="Italic">Z</Emphasis>
 = 74 - 210
</BookTitle>''')
>>> print parse(root)
Mtn Z = 74 - 210

Which should give you a basic idea what to do.

Update: Fixed the whitespace...

You can use the minidom parser. Here is an example:

from xml.dom import minidom

def strip_tags(node):
    text = ""
    for child in node.childNodes:
        if child.nodeType == doc.TEXT_NODE:
            text += child.toxml()
        else:
            text += strip_tags(child)
    return text

doc = minidom.parse("<your-xml-file>")

text = strip_tags(doc)

The strip_tags recursive function will browse the xml tree and extract the text in order.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM