I have to get the pure text out of a xml-node and its child nodes, or what else these strange inner-tags are:
Example-Nodes:
<BookTitle>
<Emphasis Type="Italic">Z</Emphasis>
= 63 - 100
</BookTitle>
or:
<BookTitle>
Mtn
<Emphasis Type="Italic">Z</Emphasis>
= 74 - 210
</BookTitle>
I have to get:
Z = 63 - 100
Mtn Z = 74 - 210
Remember, this is just an example! There could be any type of "Child-Nodes" inside the BookTitle-Node, and all I need is the pure Text inside BookTitle.
I tried:
tagtext = root.find('.//BookTitle').text
print tagtext
but .text can't deal with this strange xml-nodes and gives me a "NoneType" back
Regards & Thanks!
That's not the text
of the BookTitle
node, it's the tail
of the Emphasis
node. So you should do something like:
def parse(el):
text = el.text.strip() + ' ' if el.text.strip() else ''
for child in el.getchildren():
text += '{0} {1}\n'.format(child.text.strip(), child.tail.strip())
return text
Which gives you:
>>> root = et.fromstring('''
<BookTitle>
<Emphasis Type="Italic">Z</Emphasis>
= 63 - 100
</BookTitle>''')
>>> print parse(root)
Z = 63 - 100
And for:
>>> root = et.fromstring('''
<BookTitle>
Mtn
<Emphasis Type="Italic">Z</Emphasis>
= 74 - 210
</BookTitle>''')
>>> print parse(root)
Mtn Z = 74 - 210
Which should give you a basic idea what to do.
Update: Fixed the whitespace...
You can use the minidom parser. Here is an example:
from xml.dom import minidom
def strip_tags(node):
text = ""
for child in node.childNodes:
if child.nodeType == doc.TEXT_NODE:
text += child.toxml()
else:
text += strip_tags(child)
return text
doc = minidom.parse("<your-xml-file>")
text = strip_tags(doc)
The strip_tags recursive function will browse the xml tree and extract the text in order.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.