In Python - Parsing a response xml and finding a specific text vaule

Question

I'm new to python and I'm having a particularly difficult time working with xml and python. The situation I have is this, I'm trying to count the number of times a word appears in an xml document. Simple enough, but the xml document is a response from a server. Is it possible to do this without writing to a file? It would be great trying to do it from memory.

Here is a sample xml code:

<xml>
  <title>Info</title>
    <foo>aldfj</foo>
      <data>Text I want to count</data>
</xml>

Here is what I have in python

import urllib2
import StringIO
import xml.dom.minidom
from xml.etree.ElementTree import parse
usock = urllib.urlopen('http://www.example.com/file.xml') 
xmldoc = minidom.parse(usock)
print xmldoc.toxml()

Past This point I have tried using StringIO, ElementTree, and minidom to no success and I have gotten to a point where I'm not sure what else to do.

Any help would be greatly appreciated

Answer 1

It's quite simple, as far as I can tell:

import urllib2
from xml.dom import minidom

usock = urllib2.urlopen('http://www.example.com/file.xml') 
xmldoc = minidom.parse(usock)

for element in xmldoc.getElementsByTagName('data'):
  print element.firstChild.nodeValue

So to count the occurrences of a string, try this (a bit condensed, but I like one-liners):

count = sum(element.firstChild.nodeValue.find('substring') for element in xmldoc.getElementsByTagName('data'))

Answer 2

If you are just trying to count the number of times a word appears in an XML document, just read the document as a string and do a count:

import urllib2
data = urllib2.urlopen('http://www.example.com/file.xml').read()
print data.count('foobar')

Otherwise, you can just iterate through the tags you are looking for:

from xml.etree import cElementTree as ET
xml = ET.fromstring(urllib2.urlopen('http://www.example.com/file.xml').read())
for data in xml.getiterator('data'):
    # do something with
    data.text

Answer 3

Does this help ...

from xml.etree.ElementTree import XML

txt = """<xml>
           <title>Info</title>
           <foo>aldfj</foo>
           <data>Text I want to count</data>
         </xml>"""

# this will give us the contents of the data tag.
data = XML(txt).find("data").text

# ... so here we could do whatever we want
print data

Answer 4

Just replace the string 'count' with whatever word you want to count. If you want to count phrases, then you'll have to adapt this code as this is for word counting. But anyway, the answer to how to get at all the embedded text is XML('<your xml string here>').itertext()

from xml.etree.ElementTree import XML
from re import findall

txt = """<xml>
        <title>Info</title>
        <foo>aldfj</foo>
        <data>Text I want to count</data>
    </xml>"""

sum([len(filter(lambda w: w == 'count', findall('\w+', t))) for t in XML(txt).itertext()])

In Python - Parsing a response xml and finding a specific text vaule

Question

4 answers

solution1
5 2011-10-05 22:01:37

solution2
4 2011-10-05 22:00:28

solution3
2 2011-10-05 21:59:41

solution4
0 2011-10-05 22:27:09

In Python - Parsing a response xml and finding a specific text vaule

Question

4 answers

solution1 5 2011-10-05 22:01:37

solution2 4 2011-10-05 22:00:28

solution3 2 2011-10-05 21:59:41

solution4 0 2011-10-05 22:27:09

solution1
5 2011-10-05 22:01:37

solution2
4 2011-10-05 22:00:28

solution3
2 2011-10-05 21:59:41

solution4
0 2011-10-05 22:27:09