简体   繁体   中英

How to get value from XML Tag in Python?

I have XML file as below.

<?xml version="1.0" encoding="UTF-8"?><searching>
   <query>query01</query>
   <document id="0">
      <title>lord of the rings.</title>
    <snippet>
      this is a snippet of a document.
    </snippet>
      <url>http://www.google.com/</url>
   </document>
   <document id="1">
      <title>harry potter.</title>
    <snippet>
            this is a snippet of a document.
    </snippet>
      <url>http://www.google.com/</url>
   </document>
   ........ #and other documents .....

  <group id="0" size="298" score="145">
      <title>
         <phrase>GROUP A</phrase>
      </title>
      <document refid="0"/>
      <document refid="1"/>
      <document refid="84"/>
   </group>
  <group id="0" size="298" score="55">
      <title>
         <phrase>GROUP B</phrase>
      </title>
      <document refid="2"/>
      <document refid="13"/>
      <document refid="3"/>
   </group>
   </<searching>>

I want to get the group name above and what are the document id (and its title) in each group. My idea is store document id and document title into dictionary as:

import codecs
documentID = {}    
group = {}

myfile = codecs.open("file.xml", mode = 'r', encoding = "utf8")
for line in myfile:
    line = line.strip()
    #get id from tags
    #get title from tag
    #store in documentID 


    #get group name and document reference

Moreover, I have tried BeautifulSoup but very new to it. I don't know how to do. this is the code I am doing.

def outputCluster(rFile):
    documentInReadFile = {}         #dictionary to store all document in readFile

    myfile = codecs.open(rFile, mode='r', encoding="utf8")
    soup = BeautifulSoup(myfile)
    # print all text in readFile:
    # print soup.prettify()

    # print soup.find+_all('title')

outputCluster("file.xml")

Please kindly leave me some suggestion. Thank you.

Did you have a look at Python's XML etree parser? There are plenty of examples on the web.

The previous posters have the right of it. The etree documentation can be found here:

https://docs.python.org/2/library/xml.etree.elementtree.html#module-xml.etree.ElementTree

And can help you out. Here's a code sample that might do the trick (partially taken from the above link):

import xml.etree.ElementTree as ET
tree = ET.parse('your_file.xml')
root = tree.getroot()

for group in root.findall('group'):
  title = group.find('title')
  titlephrase = title.find('phrase').text
  for doc in group.findall('document'):
    refid = doc.get('refid')

Or if you want the ID stored in the group tag, you'd use id = group.get('id') instead of searching for all the refid s.

Elementree is brilliant for looking through XML. If you go into the docs, it shows you how to manipulate the XML in many ways, including how to get the contents of a tag. An exmaple from the docs is:
XML:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

Code:

>>> for country in root.findall('country'):
...   rank = country.find('rank').text
...   name = country.get('name')
...   print name, rank
...
Liechtenstein 1
Singapore 4
Panama 68

Which you could manipulate easily enough to do what you want.

BeautifulSoup is nice to use, a bit surprising at first.

soup = BeautifulSoup(myfile)

soup becomes the whole file, then you have to search through it to find the part you need, for instance :

group = soup.find(name="group, attrs={'id':'0', 'size':'298'}")

group now contains the tag group and its contents (the first matching group it found) :

<group>blabla its contents<tag inside it>blabla</tag inside it>etc.</group>

do this a number of times to get to the lowermost tags, the more detailed the less chances to land on the wrong tag, then

lastthingyoufound.find(name='phrase')

will contain your answer, which will still contain the tags so you need to use another function depending on bs version. use findall to make lists on which you can iterate to find multiple elements, and feel free to keep track of old tags so you can find other info later, rather than doing soup=soup.find(...), which means you're only looking for one specific thing and lose tags in between, which is the same as doing soup = find(...).find(...).findall(...)[-1].find(...)['id'], for instance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM