使用 lxml xpath 解析 xml 文件

Question

我正在使用 lxml XPath 來解析以下 xml 文件

<urlset
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
    xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
    <url>
        <loc>
    https://www.reuters.com/article/us-campbellsoup-thirdpoint/campbell-soup-nears-deal-with-third-point-to-end-board-challenge-sources-idUSKCN1NU11I
    </loc>
        <image:image>
            <image:loc>
    https://www.reuters.com/resources/r/?m=02&d=20181126&t=2&i=1328589868&w=&fh=&fw=&ll=460&pl=300&r=LYNXNPEEAO0WM
    </image:loc>
        </image:image>
        <news:news>
            <news:publication>
                <news:name>Reuters</news:name>
                <news:language>eng</news:language>
            </news:publication>
            <news:publication_date>2018-11-26T02:55:00+00:00</news:publication_date>
            <news:title>
    Campbell Soup nears deal with Third Point to end board challenge: sources
    </news:title>
            <news:keywords>Headlines,Business, Industry</news:keywords>
            <news:stock_tickers>NYSE:CPB</news:stock_tickers>
        </news:news>
    </url>
</urlset>

Python代碼示例

import lxml.etree
import lxml.html
import requests

def main():
    r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")

    namespace = "http://www.google.com/schemas/sitemap-news/0.9"
    root = lxml.etree.fromstring(r.content)


    records = root.xpath('//news:title', namespaces = {"news": "http://www.google.com/schemas/sitemap-news/0.9"})
    for record in records:
        print(record.text)


    records = root.xpath('//sitemap:loc', namespaces = {"sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9"})
    for record in records:
        print(record.text)


if __name__ == "__main__":
    main()

目前，我使用 XPath 來獲取所有URL和title ，但這不是我想要的，因為我不知道哪個 URL 屬於哪個標題。 我的問題是如何獲取每個<url> ，然后將每個<url>作為項目循環以獲取相應的<loc>和<news:keywords>等。謝謝！

編輯：期待輸出

foreach <url>
      get <loc>
      get <news:publication_date>
      get <news:title>

Answer 1

使用相對 XPath 從每個標題獲取其關聯的 URL：

ns = {
    "news": "http://www.google.com/schemas/sitemap-news/0.9",
    "sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "image": "http://www.google.com/schemas/sitemap-image/1.1"
}

r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
root = lxml.etree.fromstring(r.content)

for title in root.xpath('//news:title', namespaces=ns):
    print(title.text)

    loc = title.xpath('ancestor::sitemap:url/sitemap:loc', namespaces=ns)
    print(loc[0].text)

練習：重寫此代碼以從 URL 獲取關聯的標題。

注意：標題（可能還有 URL）似乎是 HTML 轉義的。 使用unescape()函數

from html import unescape

逃避他們。

Answer 2

答案是

from datetime import datetime
from html import unescape
from lxml import etree
import requests

r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
root = etree.fromstring(r.content)

ns = {
    "news": "http://www.google.com/schemas/sitemap-news/0.9",
    "sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "image": "http://www.google.com/schemas/sitemap-image/1.1"
}

for url in root.iterfind("sitemap:url", namespaces=ns):
    loc = url.findtext("sitemap:loc", namespaces=ns)
    print(loc)
    title = unescape(url.findtext("news:news/news:title", namespaces=ns))
    print(title)
    date = unescape(url.findtext("news:news/news:publication_date", namespaces=ns))
    date = datetime.strptime(date, '%Y-%m-%dT%H:%M:%S+00:00')
    print(date)

經驗法則是：

盡量不要使用 xpath 。 使用 find、findall、iterfind 代替 xpath。 xpath 是一種比 find、findall 或 iterfind 更復雜的算法，它需要更多的時間和資源。

使用iterfind而不是使用 findall。 因為 iterfind 將產生返回項目。 也就是說，它將一次返回一項。 因此它使用更少的內存。

如果您只需要文本，請使用findtext 。

更一般的規則是閱讀官方文檔。

首先，讓我們創建 3 個 for 循環函數並比較它們。

def for1():
    for url in root.iterfind("sitemap:url", namespaces=ns):
        pass

def for2():
    for url in root.findall("sitemap:url", namespaces=ns):
        pass

def for3():
    for url in root.xpath("sitemap:url", namespaces=ns):
        pass

功能	時間
`root.iterfind`	70.5 微秒 ± 543 納秒
`root.findall`	72.3 微秒 ± 839 納秒
`root.xpath`	84.8 微秒 ± 567 納秒

我們可以看到 iterfind 是預期的最快的。

接下來，讓我們檢查一下 for 循環內的語句。

陳述	時間
`url.xpath('string(news:news/news:title)', namespaces=ns)`	15.7 微秒 ± 112 納秒
`url_item.xpath('news:news/news:title', namespaces=ns)[0].text`	14.4 微秒 ± 53.7 納秒
`url_item.find('news:news/news:title', namespaces=ns).text`	3.74 微秒 ± 60 納秒
`url_item.findtext('news:news/news:title', namespaces=ns)`	3.71 微秒 ± 40.3 納秒

從上表中，我們可以看到 find/findtext 比 xpath 快 4 倍。 findtext 甚至比 find 更快。

與 Tomalak 的 8.33 ms ± 52.4 µs 相比，此答案僅需 3.41 ms ± 53 µs

使用 lxml xpath 解析 xml 文件

問題描述

2 個解決方案

解決方案1
1 已采納 2018-11-27 04:17:02

解決方案2
0 2021-01-11 03:16:32

使用 lxml xpath 解析 xml 文件

問題描述

2 個解決方案

解決方案1 1 已采納 2018-11-27 04:17:02

解決方案2 0 2021-01-11 03:16:32

解決方案1
1 已采納 2018-11-27 04:17:02

解決方案2
0 2021-01-11 03:16:32