![](/img/trans.png)
[英]python beautifulsoup: how to find all before certain stop tag?
[英]How to find a href tag with a certain file format in BeautifulSoup in Python
我正在嘗試從網站獲取 XML 鏈接。 可以在https://www.loc.gov/item/2015669100/
找到示例頁面
使用下面的代碼,它只能找到 PDF 文件鏈接而不是 xml 鏈接。
productDivs = soup.findAll('div', attrs={'class' : 'views'})
for div in productDivs:
xml = div.find('a')['href']
if xml.endswith('xml'):
print(xml)
如何獲取 XML 文件鏈接?
您可以使用 CSS 選擇器nth-of-type(n)
:
import requests
from bs4 import BeautifulSoup
URL = "https://www.loc.gov/item/2015669100/"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
print('PDF:', soup.select_one('.views a:nth-of-type(1)')['href'])
print('XML:', soup.select_one('.views a:nth-of-type(2)')['href'])
或使用find_next_sibling()
:
...
productDivs = soup.findAll("div", attrs={"class": "views"})
for div in productDivs:
pdf = div.find("a")
xml = pdf.find_next_sibling("a")["href"]
print("PDF:", pdf["href"])
print("XML:", xml)
輸出:
PDF: https://tile.loc.gov/storage-services/service/afc/afc2010039/afc2010039_crhp0001_Carter_transcript/afc2010039_crhp0001_Carter_transcript.pdf
XML: https://tile.loc.gov/storage-services/service/afc/afc2010039/afc2010039_crhp0001_Carter_transcript/afc2010039_crhp0001_Carter_transcript.xml
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.