為什么我不能使用此代碼下載 pdf？

Question

import requests
from bs4 import BeautifulSoup
from urllib.request import urlretrieve

url='https://www.go100.com.tw/exam_download_3.php'
response = requests.get(url)
response.encoding = 'utf-8'

soup = BeautifulSoup(response.text, 'html.parser')
#print(soup)
all_as = soup.find_all('a')

for index, a_tag in enumerate(all_as):
    if 'pdf' in a_tag['href']:  
        #print(a_tag['href'])
        urlretrieve(a_tag['href'], 'file_tmp.pdf')
        break

它顯示 ValueError，我找不到問題所在。 這是結果

Answer 1

你已經完成了 90% 的工作。 您必須使用urljoin中的urllib.parse ：

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url='https://www.go100.com.tw/exam_download_3.php'
response = requests.get(url)
response.encoding = 'utf-8'

soup = BeautifulSoup(response.text, 'html.parser')
#print(soup)
all_as = soup.find_all('a')

for index, a_tag in enumerate(all_as):
    if 'pdf' in a_tag['href']:  
        print(a_tag['href'])
        print(urljoin(url, a_tag['href']))
        response = requests.get(urljoin(url, a_tag['href']))
        open("file_tmp.pdf", "wb").write(response.content)
        break

Answer 2

您只能使用請求下載所有這些文件：

import requests
from bs4 import BeautifulSoup
import re

url='https://www.go100.com.tw/exam_download_3.php'

s = requests.Session()

correct_links = []
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
links = [a.get('href') for a in soup.select('a') if '.pdf' in a.get('href')]
for link in links:
    if 'https://' not in link:
        link = 'https://www.go100.com.tw' + link
    correct_links.append(link)
for link in list(set(correct_links)):
    r = s.get(link)
    with open(f"{re.sub(r'[^a-zA-Z0-9]', '', link)}.pdf", "wb") as f:
        f.write(r.content)
        print(f"saved {re.sub(r'[^a-zA-Z0-9]', '', link)}")

這會將所有可下載的 pdf 保存在運行腳本的同一文件夾中，並具有相關名稱。 請求文檔： https://requests.readthedocs.io/en/latest/

為什么我不能使用此代碼下載 pdf？

問題描述

2 個解決方案

解決方案1
0 2022-08-14 07:06:50

解決方案2
0 2022-08-14 13:37:24

為什么我不能使用此代碼下載 pdf？

問題描述

2 個解決方案

解決方案1 0 2022-08-14 07:06:50

解決方案2 0 2022-08-14 13:37:24

解決方案1
0 2022-08-14 07:06:50

解決方案2
0 2022-08-14 13:37:24