简体   繁体   中英

Finding namespace URIs for lxml

I'm using lxml to parse XML product feeds with the following code:

namespace = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc",namespaces=namespace)]

This works with the majority of feeds that I am using as an input, but I occasionally I find a feed with additional namespaces such as the below:

<?xml version="1.0" encoding="UTF-8"?>
<urlset
      xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="https://www.sitemaps.org/schemas/sitemap/0.9
            https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">

<url>
  <loc>https://www.example.com/</loc>
  <priority>1.00</priority>
</url>

From what I've read I would need to add the additional namespace here (xmlns:xsi I guess) to the namespace dictionary to get my xpath to work with multiple namespaces. However, this is not a long term solution for me as I might come across other differing namespaces in the future - is there a way for me to search/detect or even delete the namespace? The element tree always will be the same, so my xpath wouldn't change.

Thanks

You shouldn't need to map the xsi prefix; that's only there for the xsi:schemaLocation attribute.

The difference between your current mapping and the input file is that there is an "s" in "https" in the default namespace of the XML.

To handle both namespace URIs (or really any other namespace URI that urlset might have) is to first get the namespace URI for the root element and then use that in your dict mapping...

from lxml import etree

tree = etree.parse("input.xml")

root_ns_uri = tree.xpath("namespace-uri()")

namespace = {"sm": root_ns_uri}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc", namespaces=namespace)]

print(data)

prints...

['https://www.example.com/']

If urlset isn't always the root element, you may want to do something like this instead...

root_ns_uri = tree.xpath("namespace-uri(//*[local-name()='urlset'])")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM