如何在Python中的BeautifulSoup中找到具有某種文件格式的href標簽

Question

我正在嘗試從網站獲取 XML 鏈接。 可以在https://www.loc.gov/item/2015669100/找到示例頁面

使用下面的代碼，它只能找到 PDF 文件鏈接而不是 xml 鏈接。

productDivs = soup.findAll('div', attrs={'class' : 'views'})
        for div in productDivs:
                xml = div.find('a')['href']
                if xml.endswith('xml'):
                        print(xml)

如何獲取 XML 文件鏈接？

Answer 1

您可以使用 CSS 選擇器nth-of-type(n) ：

import requests
from bs4 import BeautifulSoup

URL = "https://www.loc.gov/item/2015669100/"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")

print('PDF:', soup.select_one('.views a:nth-of-type(1)')['href'])
print('XML:', soup.select_one('.views a:nth-of-type(2)')['href'])

或使用find_next_sibling() ：

...
productDivs = soup.findAll("div", attrs={"class": "views"})

for div in productDivs:
    pdf = div.find("a")
    xml = pdf.find_next_sibling("a")["href"]

    print("PDF:", pdf["href"])
    print("XML:", xml)

輸出：

PDF: https://tile.loc.gov/storage-services/service/afc/afc2010039/afc2010039_crhp0001_Carter_transcript/afc2010039_crhp0001_Carter_transcript.pdf
XML: https://tile.loc.gov/storage-services/service/afc/afc2010039/afc2010039_crhp0001_Carter_transcript/afc2010039_crhp0001_Carter_transcript.xml

如何在Python中的BeautifulSoup中找到具有某種文件格式的href標簽

問題描述

1 個解決方案

解決方案1
1 2020-10-01 00:25:56

如何在Python中的BeautifulSoup中找到具有某種文件格式的href標簽

問題描述

1 個解決方案

解決方案1 1 2020-10-01 00:25:56

解決方案1
1 2020-10-01 00:25:56