简体   繁体   中英

finding text into namespaced xml elements with lxml.etree

I try to use lxml.etree to parse an XML file and find text into elements of the XML.

XML files can be as such:

<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" 
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
     http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
 <responseDate>2002-06-01T19:20:30Z</responseDate> 
 <request verb="ListRecords" from="1998-01-15"
      set="physics:hep"
      metadataPrefix="oai_rfc1807">
      http://an.oa.org/OAI-script</request>
 <ListRecords>
  <record>
    <header>
      <identifier>oai:arXiv.org:hep-th/9901001</identifier>
      <datestamp>1999-12-25</datestamp>
      <setSpec>physics:hep</setSpec>
      <setSpec>math</setSpec>
    </header>
    <metadata>
     <rfc1807 xmlns=
    "http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt" 
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation=
       "http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt
    http://www.openarchives.org/OAI/1.1/rfc1807.xsd">
    <bib-version>v2</bib-version>
    <id>hep-th/9901001</id>
    <entry>January 1, 1999</entry>
    <title>Investigations of Radioactivity</title>
    <author>Ernest Rutherford</author>
    <date>March 30, 1999</date>
     </rfc1807>
    </metadata>
    <about>
      <oai_dc:dc 
      xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ 
      http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
    <dc:publisher>Los Alamos arXiv</dc:publisher>
    <dc:rights>Metadata may be used without restrictions as long as 
       the oai identifier remains attached to it.</dc:rights>
      </oai_dc:dc>
    </about>
  </record>
  <record>
    <header status="deleted">
      <identifier>oai:arXiv.org:hep-th/9901007</identifier>
      <datestamp>1999-12-21</datestamp>
    </header>
  </record>
 </ListRecords>
</OAI-PMH>

For the following part we assume doc = etree.parse("/tmp/test.xml") where text.xml contains the xml pasted above.

First I try to find all the <record> elements using doc.findall(".//record") but it returns an empty list.

Secondly, for a given word I'd like to check if it is in the <dc:publisher> . To achieve this I try first to do the same as earlier : doc.findall(".//publisher") but i've the same issue... I'm pretty sure all of this is linked with namespaces but I don't know how to handle them.

I've read the libxml tutorial , and tried the example for findall method on a basic xml file (without any namespace) and it worked out.

As Chris has already mentioned, you can also use lxml and xpath. As xpath doesn't allow you to write the namespaced names in full like {http://www.openarchives.org/OAI/2.0/}record (so-called "James Clark notation" *), you will have to use prefixes, and provide the xpath engine with a prefix-to-namespace-uri mapping.

Example with lxml (assuming you already have the desired tree object):

nsmap = {'oa':'http://www.openarchives.org/OAI/2.0/', 
         'dc':'http://purl.org/dc/elements/1.1/'}
tree.xpath('//oa:record[descendant::dc:publisher[contains(., "Alamos")]]',
            namespaces=nsmap)

This will select all {http://www.openarchives.org/OAI/2.0/}record elements that have a descendant element {http://purl.org/dc/elements/1.1/}dc containing the word "Alamos".

[*] this comes from an article where James Clark explains XML Namespaces, everyone not familiar with namespaces should read this! (even if it was written a long time ago)

Disclaimer : I am using the standard library xml.etree.ElementTree module, not the lxml library (although this is a subset of lxml as far as I know). I'm sure there is an answer which is much simpler than mine which uses lxml and XPATH, but I don't know it.

Namespace issue

You were right to say that the problem is likely the namespaces. There is no record element in your XML file, but there are two {http://www.openarchives.org/OAI/2.0/}record tags in the file. As the following demonstrates:

>>> import xml.etree.ElementTree as etree

>>> xml_string = ...Your XML to parse...
>>> e = etree.fromstring(xml_string)

# Let's see what the root element is
>>> e
<Element {http://www.openarchives.org/OAI/2.0/}OAI-PMH at 7f39ebf54f80>

# Let's see what children there are of the root element
>>> for child in e:
...     print child
...
<Element {http://www.openarchives.org/OAI/2.0/}responseDate at 7f39ebf54fc8>
<Element {http://www.openarchives.org/OAI/2.0/}request at 7f39ebf58050>
<Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>

# Finally, let's get the children of the `ListRecords` element
>>> for child in e[-1]:
...     print child
... 
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>

So, for example

>>> e.find('ListRecords')

returns None , whereas

>>> e.find('{http://www.openarchives.org/OAI/2.0/}ListRecords'
<Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>

returns the ListRecords element.

Note that I am using the find method since the standard library ElementTree does not have an xpath method.

Possible solution

One way to solve this and to get the namespace prefix and prepend this to the tag you are trying to find. You can use

>>>> e.tag[:e.tag.index('}')+1]
'{http://www.openarchives.org/OAI/2.0/}'

on the root element, e , to find the namespace, although I'm sure there is a better way of doing this.

Now we can define functions to extract the tags we want we an optional namespace prefix:

def findallNS(element, tag, namespace=None):

    if namspace is not None:
        return element.findall(namepsace+tag)
    else:
        return element.findall(tag)

def findNS(element, tag, namespace=None):

    if namspace is not None:
        return element.find(namepsace+tag)
    else:
        return element.find(tag)

So now we can write:

>>> list_records = findNS(e, 'ListRecords', namespace)
>>> findallNS(list_records, 'record', namespace)
[<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>, 
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>]

Alternative solution

Another solution maybe to write a function to search for all tags which end with the tag you are interested in, for example:

def find_child_tags(element, tag):
    return [child for child in element if child.tag.endswith(tag)]

Here you don't need to deal with the namespace at all.

@Chris answer is very good and it will work with lxml too. Here is another way using lxml (works the same way with xpath instead of find ):

In [37]: xml.find('.//n:record', namespaces={'n': 'http://www.openarchives.org/OAI/2.0/'})
Out[37]: <Element {http://www.openarchives.org/OAI/2.0/}record at 0x2a451e0>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM