Extract all the text from xml data with python

Question

I'm new to xml data processing. I want to extract the text data in the following xml file:

<data>
    <p>12345<strong>45667</strong>abcde</p>
</data>

so that expected result is: ['12345','45667', 'abcde'] Currently I have tried:

tree = ET.parse('data.xml')
data = tree.getiterator()
text = [data[i].text for i in range(0, len(data))]

But the result only shows ['12345','45667'] . 'abcde' is missing. Can someone help me? Thanks in advance!

Answer 1

Try doing this using xpath and lxml :

import lxml.etree as etree

string = '''
<data>
    <p>12345<strong>45667</strong>abcde</p>
</data>
'''

tree = etree.fromstring(string)

print(tree.xpath('//p//text()'))

The Xpath expression means: "select all p elements wich containing text recursively"

OUTPUT:

['12345', '45667', 'abcde']

Answer 2

getiterator() (or it's replacement iter() ) iterates over child tags/elements, while abcde is a text node, a tail of the strong tag.

You can use itertext() method:

import xml.etree.ElementTree as ET

tree = ET.parse('test.xml')
print list(tree.find('p').itertext())

Prints:

['12345', '45667', 'abcde']

Extract all the text from xml data with python

Question

2 answers

solution1
2 ACCPTED 2015-01-05 19:02:55

OUTPUT:

solution2
1 2015-01-05 19:06:36

Extract all the text from xml data with python

Question

2 answers

solution1 2 ACCPTED 2015-01-05 19:02:55

OUTPUT:

solution2 1 2015-01-05 19:06:36

solution1
2 ACCPTED 2015-01-05 19:02:55

solution2
1 2015-01-05 19:06:36