简体   繁体   中英

Lxml element equality with namespaces

I am attempting to use Lxml to parse the contents of a .docx document. I understand that lxml replaces namespace prefixes with the actual namespace, however this makes it a real pain to check what kind of element tag I am working with. I would like to be able to do something like

if (someElement.tag == "w:p"):

but since lxml insists on prepending te ful namespace I'd either have to do something like

if (someElemenet.tag == "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p'):

or perform a lookup of the full namespace name from the element's nsmap attribute like this

targetTag = "{%s}p" % someElement.nsmap['w']
if (someElement.tag == targetTag):

If there were was an easier way to convince lxml to either

  1. Give me the tag string without the namespace appended to it, I can use the prefix attribute along with this information to check which tag I'm working with OR
  2. Just give me the tag string using the prefix

This would save a lot of keystrokes when writing this parser. Is this possible? Am I missing something in the documentation?

Perhaps use local-name() :

import lxml.etree as ET
tree = ET.fromstring('<root xmlns:f="foo"><f:test/></root>')
elt=tree[0]
print(elt.xpath('local-name()'))
# test

etree.Qname should be able to get you what you want.

from lxml import etree

# [...]

tag = etree.QName(someElement)

print(tag.namespace, tag.localname)

For your example tag, this will output:

http://schemas.openxmlformats.org/wordprocessingml/2006/main p

Note that QName will take either the Element object or a string (such as from Element.tag ).

And, as you note, you can also use Element.nsmap to map from an arbitrary prefix to a namespace.

So something like this:

if tag.namespace == someElement.nsmap["w"] and tag.localname == "p":

I could not find a way to obtain the non-namespaced tag name from an element -- lxml considers the full namespace part of the tag name. Here are a few options which may help..

You could also use the QName class to construct a namespaced tag for comparisons:

import lxml.etree
from lxml.etree import QName

tree = lxml.etree.fromstring('<root xmlns:f="foo"><f:test/></root>')
qn = QName(tree.nsmap['f'], 'test')
assert tree[0].tag == qn

If you need the bare tag name you'll have to write a utility function to extract it:

def get_bare_tag(elem):
    return elem.tag.rsplit('}', 1)[-1]

assert get_bare_tag(tree[0]) == 'test'

Unfortunately, to my knowledge you can't search for tags with "any namespace" (eg {*}test ) using lxml's xpath / find methods.

Updated : Note that lxml won't construct a tag that contains only { or } -- it will raise ValueError: invalid tag name, so it is safe to assume that an element whose tag name starts with { is balanced.

lxml.etree.Element('{foo')
ValueError: Invalid tag name

To save time when looking for high-volume tags like p (paragraph, I presume) in docx or c (cell) in xlsx, it's usual to set up the full tag once at the global or class level:

WPML_URI = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
tag_p = WPML_URI + 'p'
tag_t = WPML_URI + 't'

I have never seen an explanation of why one would want to use QName() .

In the other direction, given a full tag, you can extract the base tag easily:

base_tag = full_tag.rsplit("}", 1)[-1]

I'm no Python expert, but I also had this problem (Windows 7 "Contacts" files). I wrote the following function for the lxml system.

This function takes an element, and returns its tag with the prefix substituted from the file's ns tag.

from lxml import etree

def denstag(ee):
  tag = ee.tag
  for ns in ee.nsmap:
    prefix = "{"+ee.nsmap[ns]+"}"
    if tag.startswith(prefix):               
      return ns+":"+tag[len(prefix):]
  return tag

Here is my solution for restoring real (source) xml tag name

Assuming we have xml_node variable, an instance of lxml Element

Before: {http://some/namespace/url}TagName (as read from xml_node.tag prop)

After: nsprefix:TagName (as result of xml_get_real_tag_name(xml_node) )

def xml_get_real_tag_name(xml_node):
    """Replace lxml '{http://some/namespace/url}TagName' with regular 'nsprefix:TagName' string
    Args:
        xml_node (lxml.etree.Element) Source xml node entity
    Returns:
        str
    """
    if '{' in xml_node.tag:
    return ':'.join([xml_node.prefix, etree.QName(xml_node).localname])
else:
    return xml_node.tag

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM