繁体   English   中英

从网站python中提取所有pdf文件[关闭]

[英]Extract all pdf files from website python [closed]

使用 Python,我想获取每个查询的最后一页。 我怎样才能用我的代码做到这一点?

from bs4 import BeautifulSoup 
import requests, lxml, urllib.request
from random import randint
from time import sleep
from urllib.parse import urljoin
import re
import os
from datetime import date


folder_location = r'./ASN Documents'
if not os.path.exists(folder_location):os.mkdir(folder_location)

我想用这些查询提取 pdf query_list=["maintenance","fukushima"]

url_asn="https://www.asn.fr/recherche?filter_year[from]={}&filter_year[to]={}&limit=50&&search_text={}&sort_type=date&page={}"

while True:
    for query in query_list:
        for page in range(1, last_page, 1):
            format_url=url_asn.format(date.today().year - 10, date.today().year, query, page)
            url=format_url.replace(" ", "%20")
            req=requests.get(url)
            soup = BeautifulSoup(req.content, 'html.parser')
            for link in soup.findAll("a", class_="Teaser-titleLink"):
                if link['href'] != '.pdf':
                    jk="https://www.asn.fr/"+link['href']
                    reqHTML = requests.get(jk):
                    soupHTML=BeautifulSoup(reqHTML.content, 'html.parser')
                    for pdf in soupHTML.select("a[href$='.pdf']"):
                        filename = os.path.join(folder_location, link.getText()
                        .rstrip()
                        .replace("%20", " "))
                        with open(f"{filename}.pdf", 'wb') as f:
                            f.write(requests.get(urljoin(url, link['href'])).content) 

要查找最后一页,您可以使用find_all然后访问倒数第二个索引:

import requests
from bs4 import BeautifulSoup

url = 'https://www.asn.fr/recherche?filter_year[from]=2012&filter_year[to]=2022&limit=50&&search_text=maintenance&sort_type=date&page=1'
rsp = requests.get(url)
soup = BeautifulSoup(rsp.content, 'html.parser')

int(soup.find_all("a", {"class": "page-link"})[-2].text)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM