Finding namespace URIs for lxml

Question

I'm using lxml to parse XML product feeds with the following code:

namespace = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc",namespaces=namespace)]

This works with the majority of feeds that I am using as an input, but I occasionally I find a feed with additional namespaces such as the below:

<?xml version="1.0" encoding="UTF-8"?>
<urlset
      xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="https://www.sitemaps.org/schemas/sitemap/0.9
            https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">

<url>
  <loc>https://www.example.com/</loc>
  <priority>1.00</priority>
</url>

From what I've read I would need to add the additional namespace here (xmlns:xsi I guess) to the namespace dictionary to get my xpath to work with multiple namespaces. However, this is not a long term solution for me as I might come across other differing namespaces in the future - is there a way for me to search/detect or even delete the namespace? The element tree always will be the same, so my xpath wouldn't change.

Thanks

Answer 1

You shouldn't need to map the xsi prefix; that's only there for the xsi:schemaLocation attribute.

The difference between your current mapping and the input file is that there is an "s" in "https" in the default namespace of the XML.

To handle both namespace URIs (or really any other namespace URI that urlset might have) is to first get the namespace URI for the root element and then use that in your dict mapping...

from lxml import etree

tree = etree.parse("input.xml")

root_ns_uri = tree.xpath("namespace-uri()")

namespace = {"sm": root_ns_uri}
data = [loc.text for loc in tree.xpath("//sm:urlset/sm:url/sm:loc", namespaces=namespace)]

print(data)

prints...

['https://www.example.com/']

If urlset isn't always the root element, you may want to do something like this instead...

root_ns_uri = tree.xpath("namespace-uri(//*[local-name()='urlset'])")

Finding namespace URIs for lxml

Question

1 answers

solution1
1 ACCPTED 2020-12-21 21:32:05

Finding namespace URIs for lxml

Question

1 answers

solution1 1 ACCPTED 2020-12-21 21:32:05

solution1
1 ACCPTED 2020-12-21 21:32:05