简体   繁体   中英

Using lxml xpath to parse xml file

I'm using lxml XPath to parse the following xml file

<urlset
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
    xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
    <url>
        <loc>
    https://www.reuters.com/article/us-campbellsoup-thirdpoint/campbell-soup-nears-deal-with-third-point-to-end-board-challenge-sources-idUSKCN1NU11I
    </loc>
        <image:image>
            <image:loc>
    https://www.reuters.com/resources/r/?m=02&d=20181126&t=2&i=1328589868&w=&fh=&fw=&ll=460&pl=300&r=LYNXNPEEAO0WM
    </image:loc>
        </image:image>
        <news:news>
            <news:publication>
                <news:name>Reuters</news:name>
                <news:language>eng</news:language>
            </news:publication>
            <news:publication_date>2018-11-26T02:55:00+00:00</news:publication_date>
            <news:title>
    Campbell Soup nears deal with Third Point to end board challenge: sources
    </news:title>
            <news:keywords>Headlines,Business, Industry</news:keywords>
            <news:stock_tickers>NYSE:CPB</news:stock_tickers>
        </news:news>
    </url>
</urlset>

Python code sample

import lxml.etree
import lxml.html
import requests

def main():
    r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")

    namespace = "http://www.google.com/schemas/sitemap-news/0.9"
    root = lxml.etree.fromstring(r.content)


    records = root.xpath('//news:title', namespaces = {"news": "http://www.google.com/schemas/sitemap-news/0.9"})
    for record in records:
        print(record.text)


    records = root.xpath('//sitemap:loc', namespaces = {"sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9"})
    for record in records:
        print(record.text)


if __name__ == "__main__":
    main()

Currently, I'm XPath to get all URL and title , but this is not what I want because I don't know which URL belongs to which title. My question is how to get each <url> , then loop each <url> as item to get corresponding <loc> and <news:keywords> etc. Thanks!

Edit: Expecting output

foreach <url>
      get <loc>
      get <news:publication_date>
      get <news:title>

Use relative XPath to get from each title to its associated URL:

ns = {
    "news": "http://www.google.com/schemas/sitemap-news/0.9",
    "sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "image": "http://www.google.com/schemas/sitemap-image/1.1"
}

r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
root = lxml.etree.fromstring(r.content)

for title in root.xpath('//news:title', namespaces=ns):
    print(title.text)

    loc = title.xpath('ancestor::sitemap:url/sitemap:loc', namespaces=ns)
    print(loc[0].text)

Exercise: Rewrite this to get from the URL to the associated title instead.

Note: The titles (and potentially the URLs as well) seem to be HTML-escaped. Use the unescape() function

from html import unescape

to unescape them.

The answer is

from datetime import datetime
from html import unescape
from lxml import etree
import requests

r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
root = etree.fromstring(r.content)

ns = {
    "news": "http://www.google.com/schemas/sitemap-news/0.9",
    "sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "image": "http://www.google.com/schemas/sitemap-image/1.1"
}

for url in root.iterfind("sitemap:url", namespaces=ns):
    loc = url.findtext("sitemap:loc", namespaces=ns)
    print(loc)
    title = unescape(url.findtext("news:news/news:title", namespaces=ns))
    print(title)
    date = unescape(url.findtext("news:news/news:publication_date", namespaces=ns))
    date = datetime.strptime(date, '%Y-%m-%dT%H:%M:%S+00:00')
    print(date)

The rules of thumb are:

Try not use xpath . Instead of using xpath, use find, findall, iterfind. xpath is a more complex algorithm than just find, findall or iterfind and it takes more time and resources.

Use iterfind instead of using findall. Because iterfind will yield return the items. That is to say it will return one item at a time. Thus it uses less memory.

Use findtext if all you need is text.

A more general rule is to read the official document .

Firstly, let's create 3 for-loop function and compare them.

def for1():
    for url in root.iterfind("sitemap:url", namespaces=ns):
        pass

def for2():
    for url in root.findall("sitemap:url", namespaces=ns):
        pass

def for3():
    for url in root.xpath("sitemap:url", namespaces=ns):
        pass

function time
root.iterfind 70.5 µs ± 543 ns
root.findall 72.3 µs ± 839 ns
root.xpath 84.8 µs ± 567 ns

We can see that iterfind is the fastest as expected.

Next, let's check the statements inside the for loop.

statement time
url.xpath('string(news:news/news:title)', namespaces=ns) 15.7 µs ± 112 ns
url_item.xpath('news:news/news:title', namespaces=ns)[0].text 14.4 µs ± 53.7 ns
url_item.find('news:news/news:title', namespaces=ns).text 3.74 µs ± 60 ns
url_item.findtext('news:news/news:title', namespaces=ns) 3.71 µs ± 40.3 ns

From the above table, we can see that find/findtext is 4 times faster than xpath. And findtext is even faster than find.

This answer takes only 3.41 ms ± 53 µs, compared to Tomalak's 8.33 ms ± 52.4 µs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM