简体   繁体   English

在 lxml 中查找具有未知名称空间的元素

[英]Find element that has unknown namespace in lxml

I have an XML with many levels.我有一个具有多个级别的 XML。 Each level may have namespace attached to it.每个级别都可以附加命名空间。 I want to find a specific element whose name I know, but not its namespace.我想find一个我知道其名称但不知道其名称空间的特定元素。 For example:例如:

my_file.xml

<?xml version="1.0" encoding="UTF-8"?>
<data xmlns="aaa:bbb:ccc:ddd:eee">
  <country name="Liechtenstein" xmlns="aaa:bbb:ccc:liechtenstein:eee">
    <rank updated="yes">2</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
    <neighbor name="Austria" direction="E"/>
    <neighbor name="Switzerland" direction="W"/>
  </country>
  <country name="Singapore" xmlns="aaa:bbb:ccc:singapore:eee">
    <continent>Asia</continent>
    <holidays>
      <christmas>Yes</christmas>
    </holidays>
    <rank updated="yes">5</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
    <neighbor name="Malaysia" direction="N"/>
  </country>
  <country name="Panama" xmlns="aaa:bbb:ccc:panama:eee">
    <rank updated="yes">69</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
    <neighbor name="Costa Rica" direction="W"/>
    <neighbor name="Colombia" direction="E"/>
  </country>
</data>
import lxml.etree as etree

tree = etree.parse('my_file.xml')
root = tree.getroot()

cntry_node = root.find('.//country')

The find above does not return anything to cntry_node .上面的find不会向cntry_node返回任何内容。 In my real data, the levels are deeper than this example.在我的真实数据中,层次比这个例子更深。 The lxml document talks about namespace. lxml 文档讨论了命名空间。 When I do this:当我这样做时:

root.nsmap

I see this:我看到这个:

{None: 'aaa:bbb:ccc:ddd:eee'}

If someone could explain how to access the full nsmap and/or how to use it to find a specific element?如果有人可以解释如何访问完整的nsmap和/或如何使用它来find特定元素? Thanks very much.非常感谢。

You could declare all namespaces, but given the structure of your sample xml, I would argue you are better off disregarding namespaces altogether and just using local-name() ;您可以声明所有名称空间,但鉴于您的示例 xml 的结构,我认为您最好完全忽略名称空间而只使用local-name() so所以

cntry_node = root.xpath('.//*[local-name()="country"]')
cntry_node

returns返回

[<Element {aaa:bbb:ccc:liechtenstein:eee}country at 0x1cddf1d4680>,
 <Element {aaa:bbb:ccc:singapore:eee}country at 0x1cddf1d47c0>,
 <Element {aaa:bbb:ccc:panama:eee}country at 0x1cddf1d45c0>]

nsmap is not a global collection of all namespaces of an XML document nsmap不是 XML 文档的所有命名空间的全局集合

I believe your impression was that nsmap is a collection of all namespaces that are present in an XML document.我相信您的印象是nsmap是 XML 文档中存在的所有命名空间的集合。 And that this collection would be available after parsing the document.并且该集合在解析文档后可用。 That is not the case.事实并非如此。

nsmap gives you access to the namespace definitions of one element only. nsmap允许您访问一个元素的命名空间定义。 So this:所以这:

root = tree.getroot()
root.nsmap

Gives you the namespace definitions known in the context of the root element.为您提供root元素上下文中已知的命名空间定义。 Keep in mind that "root" is just the name of a Python variable and in fact contains the outermost element of your XML document (I know this because you called getroot() ).请记住,“root”只是 Python 变量的名称,实际上包含 XML 文档的最外层元素(我知道这一点是因为您调用了getroot() )。 The outermost element of your document is:文档的最外层元素是:

<data xmlns="aaa:bbb:ccc:ddd:eee">

so it is expected that its nsmap would contain所以预计它的 nsmap 将包含

{None: 'aaa:bbb:ccc:ddd:eee'}

(The nsmap has None in it because this is a default namespace without a namespace prefix that would go where the None is.) (nsmap 中包含None因为这是一个默认命名空间,没有命名空间前缀,即 go ,其中None是。)

XML document has a terrible structure XML 文档的结构很糟糕

Usually, the best way to deal with namespaces is to define them yourself (without taking them from the input document).通常,处理命名空间的最佳方式是自己定义它们(而不是从输入文档中获取它们)。 Suppose we would like to find the following element:假设我们想找到以下元素:

<country name="Liechtenstein" xmlns="aaa:bbb:ccc:liechtenstein:eee">

This country element is in the default namespace with the namespace URI "aaa:bbb:ccc:liechtenstein:eee".country /地区元素位于默认命名空间中,命名空间 URI 为“aaa:bbb:ccc:liechtenstein:eee”。 To find it with lxml, define a mapping:要使用 lxml 找到它,请定义一个映射:

my_own_namespace_mapping = {'prefix': 'aaa:bbb:ccc:liechtenstein:eee'}

and then use it when retrieving nodes:然后在检索节点时使用它:

root.xpath('.//prefix:country', namespaces=my_own_namespace_mapping)
[<Element {aaa:bbb:ccc:liechtenstein:eee}country at 0x7fea87f363f8>]

However, in the case of your input document it appears you would need to do that separately for each country element because they are each in their own default namespace:但是,对于您的输入文档,您似乎需要为每个country /地区元素单独执行此操作,因为它们每个都在自己的默认命名空间中:

root.xpath('.//prefix:country', namespaces={'prefix': 'aaa:bbb:ccc:singapore:eee'})
[<Element {aaa:bbb:ccc:singapore:eee}country at 0x7fea879cfd40>]

and so on.等等。 That is very impractical, not because lxml or namespaces are complicated, but because someone designed this XML format badly.这是非常不切实际的,不是因为 lxml 或命名空间很复杂,而是因为有人将这种 XML 格式设计得很糟糕。


By the way, once you found one of those elements you can use nsmap again to test if what I say above is true:顺便说一句,一旦你找到了其中一个元素,你可以再次使用nsmap来测试我上面所说的是否属实:

root.xpath('.//prefix:country', namespaces={'prefix': 'aaa:bbb:ccc:liechtenstein:eee'})[0].nsmap
{None: 'aaa:bbb:ccc:liechtenstein:eee'}

Another option is to use {*} as the namespace wildcard...另一种选择是使用{*}作为命名空间通配符...

cntry_node = root.find('.//{*}country')

Note: This only works with find() , findall() , iter() , etc.;注意:这只适用于find()findall()iter()等; not xpath() .不是xpath()

See here for more details.有关更多详细信息,请参见此处

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM