Using lxml xpath to parse xml file

Question

I'm using lxml XPath to parse the following xml file

<urlset
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
    xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
    <url>
        <loc>
    https://www.reuters.com/article/us-campbellsoup-thirdpoint/campbell-soup-nears-deal-with-third-point-to-end-board-challenge-sources-idUSKCN1NU11I
    </loc>
        <image:image>
            <image:loc>
    https://www.reuters.com/resources/r/?m=02&d=20181126&t=2&i=1328589868&w=&fh=&fw=&ll=460&pl=300&r=LYNXNPEEAO0WM
    </image:loc>
        </image:image>
        <news:news>
            <news:publication>
                <news:name>Reuters</news:name>
                <news:language>eng</news:language>
            </news:publication>
            <news:publication_date>2018-11-26T02:55:00+00:00</news:publication_date>
            <news:title>
    Campbell Soup nears deal with Third Point to end board challenge: sources
    </news:title>
            <news:keywords>Headlines,Business, Industry</news:keywords>
            <news:stock_tickers>NYSE:CPB</news:stock_tickers>
        </news:news>
    </url>
</urlset>

Python code sample

import lxml.etree
import lxml.html
import requests

def main():
    r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")

    namespace = "http://www.google.com/schemas/sitemap-news/0.9"
    root = lxml.etree.fromstring(r.content)


    records = root.xpath('//news:title', namespaces = {"news": "http://www.google.com/schemas/sitemap-news/0.9"})
    for record in records:
        print(record.text)


    records = root.xpath('//sitemap:loc', namespaces = {"sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9"})
    for record in records:
        print(record.text)


if __name__ == "__main__":
    main()

Currently, I'm XPath to get all URL and title , but this is not what I want because I don't know which URL belongs to which title. My question is how to get each <url> , then loop each <url> as item to get corresponding <loc> and <news:keywords> etc. Thanks!

Edit: Expecting output

foreach <url>
      get <loc>
      get <news:publication_date>
      get <news:title>

Answer 1

Use relative XPath to get from each title to its associated URL:

ns = {
    "news": "http://www.google.com/schemas/sitemap-news/0.9",
    "sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "image": "http://www.google.com/schemas/sitemap-image/1.1"
}

r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
root = lxml.etree.fromstring(r.content)

for title in root.xpath('//news:title', namespaces=ns):
    print(title.text)

    loc = title.xpath('ancestor::sitemap:url/sitemap:loc', namespaces=ns)
    print(loc[0].text)

Exercise: Rewrite this to get from the URL to the associated title instead.

Note: The titles (and potentially the URLs as well) seem to be HTML-escaped. Use the unescape() function

from html import unescape

to unescape them.

Answer 2

The answer is

from datetime import datetime
from html import unescape
from lxml import etree
import requests

r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
root = etree.fromstring(r.content)

ns = {
    "news": "http://www.google.com/schemas/sitemap-news/0.9",
    "sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "image": "http://www.google.com/schemas/sitemap-image/1.1"
}

for url in root.iterfind("sitemap:url", namespaces=ns):
    loc = url.findtext("sitemap:loc", namespaces=ns)
    print(loc)
    title = unescape(url.findtext("news:news/news:title", namespaces=ns))
    print(title)
    date = unescape(url.findtext("news:news/news:publication_date", namespaces=ns))
    date = datetime.strptime(date, '%Y-%m-%dT%H:%M:%S+00:00')
    print(date)

The rules of thumb are:

Try not use xpath . Instead of using xpath, use find, findall, iterfind. xpath is a more complex algorithm than just find, findall or iterfind and it takes more time and resources.

Use iterfind instead of using findall. Because iterfind will yield return the items. That is to say it will return one item at a time. Thus it uses less memory.

Use findtext if all you need is text.

A more general rule is to read the official document .

Firstly, let's create 3 for-loop function and compare them.

def for1():
    for url in root.iterfind("sitemap:url", namespaces=ns):
        pass

def for2():
    for url in root.findall("sitemap:url", namespaces=ns):
        pass

def for3():
    for url in root.xpath("sitemap:url", namespaces=ns):
        pass

function	time
`root.iterfind`	70.5 µs ± 543 ns
`root.findall`	72.3 µs ± 839 ns
`root.xpath`	84.8 µs ± 567 ns

We can see that iterfind is the fastest as expected.

Next, let's check the statements inside the for loop.

statement	time
`url.xpath('string(news:news/news:title)', namespaces=ns)`	15.7 µs ± 112 ns
`url_item.xpath('news:news/news:title', namespaces=ns)[0].text`	14.4 µs ± 53.7 ns
`url_item.find('news:news/news:title', namespaces=ns).text`	3.74 µs ± 60 ns
`url_item.findtext('news:news/news:title', namespaces=ns)`	3.71 µs ± 40.3 ns

From the above table, we can see that find/findtext is 4 times faster than xpath. And findtext is even faster than find.

This answer takes only 3.41 ms ± 53 µs, compared to Tomalak's 8.33 ms ± 52.4 µs

Using lxml xpath to parse xml file

Question

2 answers

solution1
1 ACCPTED 2018-11-27 04:17:02

solution2
0 2021-01-11 03:16:32

Using lxml xpath to parse xml file

Question

2 answers

solution1 1 ACCPTED 2018-11-27 04:17:02

solution2 0 2021-01-11 03:16:32

solution1
1 ACCPTED 2018-11-27 04:17:02

solution2
0 2021-01-11 03:16:32