I am attempting to use Lxml to parse the contents of a .docx document. I understand that lxml replaces namespace prefixes with the actual namespace, however this makes it a real pain to check what kind of element tag I am working with. I would like to be able to do something like
if (someElement.tag == "w:p"):
but since lxml insists on prepending te ful namespace I'd either have to do something like
if (someElemenet.tag == "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p'):
or perform a lookup of the full namespace name from the element's nsmap attribute like this
targetTag = "{%s}p" % someElement.nsmap['w']
if (someElement.tag == targetTag):
If there were was an easier way to convince lxml to either
This would save a lot of keystrokes when writing this parser. Is this possible? Am I missing something in the documentation?
Perhaps use local-name() :
import lxml.etree as ET
tree = ET.fromstring('<root xmlns:f="foo"><f:test/></root>')
elt=tree[0]
print(elt.xpath('local-name()'))
# test
etree.Qname
should be able to get you what you want.
from lxml import etree
# [...]
tag = etree.QName(someElement)
print(tag.namespace, tag.localname)
For your example tag, this will output:
http://schemas.openxmlformats.org/wordprocessingml/2006/main p
Note that QName
will take either the Element
object or a string (such as from Element.tag
).
And, as you note, you can also use Element.nsmap
to map from an arbitrary prefix to a namespace.
So something like this:
if tag.namespace == someElement.nsmap["w"] and tag.localname == "p":
I could not find a way to obtain the non-namespaced tag name from an element -- lxml considers the full namespace part of the tag name. Here are a few options which may help..
You could also use the QName
class to construct a namespaced tag for comparisons:
import lxml.etree
from lxml.etree import QName
tree = lxml.etree.fromstring('<root xmlns:f="foo"><f:test/></root>')
qn = QName(tree.nsmap['f'], 'test')
assert tree[0].tag == qn
If you need the bare tag name you'll have to write a utility function to extract it:
def get_bare_tag(elem):
return elem.tag.rsplit('}', 1)[-1]
assert get_bare_tag(tree[0]) == 'test'
Unfortunately, to my knowledge you can't search for tags with "any namespace" (eg {*}test
) using lxml's xpath / find methods.
Updated : Note that lxml won't construct a tag that contains only { or }
-- it will raise ValueError: invalid tag name, so it is safe to assume that an element whose tag name starts with {
is balanced.
lxml.etree.Element('{foo')
ValueError: Invalid tag name
To save time when looking for high-volume tags like p
(paragraph, I presume) in docx or c
(cell) in xlsx, it's usual to set up the full tag once at the global or class level:
WPML_URI = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
tag_p = WPML_URI + 'p'
tag_t = WPML_URI + 't'
I have never seen an explanation of why one would want to use QName()
.
In the other direction, given a full tag, you can extract the base tag easily:
base_tag = full_tag.rsplit("}", 1)[-1]
I'm no Python expert, but I also had this problem (Windows 7 "Contacts" files). I wrote the following function for the lxml system.
This function takes an element, and returns its tag with the prefix substituted from the file's ns tag.
from lxml import etree
def denstag(ee):
tag = ee.tag
for ns in ee.nsmap:
prefix = "{"+ee.nsmap[ns]+"}"
if tag.startswith(prefix):
return ns+":"+tag[len(prefix):]
return tag
Here is my solution for restoring real (source) xml tag name
Assuming we have xml_node
variable, an instance of lxml Element
Before: {http://some/namespace/url}TagName
(as read from xml_node.tag
prop)
After: nsprefix:TagName
(as result of xml_get_real_tag_name(xml_node)
)
def xml_get_real_tag_name(xml_node):
"""Replace lxml '{http://some/namespace/url}TagName' with regular 'nsprefix:TagName' string
Args:
xml_node (lxml.etree.Element) Source xml node entity
Returns:
str
"""
if '{' in xml_node.tag:
return ':'.join([xml_node.prefix, etree.QName(xml_node).localname])
else:
return xml_node.tag
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.