简体   繁体   English

查找 lxml 的命名空间 URI

[英]Finding namespace URIs for lxml

I'm using lxml to parse XML product feeds with the following code:我正在使用 lxml 使用以下代码解析 XML 产品提要:

namespace = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc",namespaces=namespace)]

This works with the majority of feeds that I am using as an input, but I occasionally I find a feed with additional namespaces such as the below:这适用于我用作输入的大多数提要,但我偶尔会发现带有其他命名空间的提要,如下所示:

<?xml version="1.0" encoding="UTF-8"?>
<urlset
      xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="https://www.sitemaps.org/schemas/sitemap/0.9
            https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">

<url>
  <loc>https://www.example.com/</loc>
  <priority>1.00</priority>
</url>

From what I've read I would need to add the additional namespace here (xmlns:xsi I guess) to the namespace dictionary to get my xpath to work with multiple namespaces.根据我的阅读,我需要在此处添加额外的命名空间(我猜是 xmlns:xsi)到命名空间字典中,以使我的 xpath 能够使用多个命名空间。 However, this is not a long term solution for me as I might come across other differing namespaces in the future - is there a way for me to search/detect or even delete the namespace?但是,这对我来说不是一个长期的解决方案,因为我将来可能会遇到其他不同的命名空间 - 有没有办法让我搜索/检测甚至删除命名空间? The element tree always will be the same, so my xpath wouldn't change.元素树总是相同的,所以我的 xpath 不会改变。

Thanks谢谢

You shouldn't need to map the xsi prefix;您不需要 map xsi前缀; that's only there for the xsi:schemaLocation attribute.这仅适用于xsi:schemaLocation属性。

The difference between your current mapping and the input file is that there is an "s" in "https" in the default namespace of the XML.您当前的映射和输入文件之间的区别在于 XML 的默认命名空间中的“https”中有一个“s”。

To handle both namespace URIs (or really any other namespace URI that urlset might have) is to first get the namespace URI for the root element and then use that in your dict mapping...要处理两个命名空间 URI(或者实际上是urlset可能具有的任何其他命名空间 URI),首先要获取根元素的命名空间 URI,然后在 dict 映射中使用它......

from lxml import etree

tree = etree.parse("input.xml")

root_ns_uri = tree.xpath("namespace-uri()")

namespace = {"sm": root_ns_uri}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc", namespaces=namespace)]

print(data)

prints...印刷...

['https://www.example.com/']

If urlset isn't always the root element, you may want to do something like this instead...如果urlset并不总是根元素,您可能想要做这样的事情......

root_ns_uri = tree.xpath("namespace-uri(//*[local-name()='urlset'])")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM