[英]Why can't I download the pdf using this code?
import requests
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
url='https://www.go100.com.tw/exam_download_3.php'
response = requests.get(url)
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'html.parser')
#print(soup)
all_as = soup.find_all('a')
for index, a_tag in enumerate(all_as):
if 'pdf' in a_tag['href']:
#print(a_tag['href'])
urlretrieve(a_tag['href'], 'file_tmp.pdf')
break
它顯示 ValueError,我找不到問題所在。 這是結果
你已經完成了 90% 的工作。 您必須使用urljoin
中的urllib.parse
:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url='https://www.go100.com.tw/exam_download_3.php'
response = requests.get(url)
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'html.parser')
#print(soup)
all_as = soup.find_all('a')
for index, a_tag in enumerate(all_as):
if 'pdf' in a_tag['href']:
print(a_tag['href'])
print(urljoin(url, a_tag['href']))
response = requests.get(urljoin(url, a_tag['href']))
open("file_tmp.pdf", "wb").write(response.content)
break
您只能使用請求下載所有這些文件:
import requests
from bs4 import BeautifulSoup
import re
url='https://www.go100.com.tw/exam_download_3.php'
s = requests.Session()
correct_links = []
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
links = [a.get('href') for a in soup.select('a') if '.pdf' in a.get('href')]
for link in links:
if 'https://' not in link:
link = 'https://www.go100.com.tw' + link
correct_links.append(link)
for link in list(set(correct_links)):
r = s.get(link)
with open(f"{re.sub(r'[^a-zA-Z0-9]', '', link)}.pdf", "wb") as f:
f.write(r.content)
print(f"saved {re.sub(r'[^a-zA-Z0-9]', '', link)}")
這會將所有可下載的 pdf 保存在運行腳本的同一文件夾中,並具有相關名稱。 請求文檔: https://requests.readthedocs.io/en/latest/
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.