簡體   English   中英

如何從多個 url python 下載所有 pdf 文件

[英]How to download all pdf files from multiple urls python

使用Python ,我想從網站下載所有 pdf 文件(以“INS”開頭的名稱除外)

url_asn="https://www.asn.fr/recherche?filter_year[from]={}&filter_year[to]={}&limit=50&search_content_type=&search_text={}&sort_type=date&page={}"

if link['href'] is not pdf ,則打開它並下載 pdf 文件(如果存在) - 對於每一頁,交互到最后一頁。

可能這會起作用嗎? 我為每一行添加了注釋。

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = " " # url to scrape

#If there is no such folder, the script will create one automatically
folder_location = r'/webscraping' # folder location
# create folder if it doesn't exist
if not os.path.exists(folder_location):os.mkdir(folder_location)
 
response = requests.get(url) # get the html
soup= BeautifulSoup(response.text, "html.parser") # parse the html 
for link in soup.select("a[href$='.pdf']"): # select all the pdf links
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1]) # join the folder location and the filename
    with open(filename, 'wb') as f: 
# open the file and write the pdf
        f.write(requests.get(urljoin(url,link['href'])).content) 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM