
[英]How to extract all the lines with keywords from pdf files in python?
[英]Extract all pdf files from website python [closed]
使用 Python,我想获取每个查询的最后一页。 我怎样才能用我的代码做到这一点?
from bs4 import BeautifulSoup
import requests, lxml, urllib.request
from random import randint
from time import sleep
from urllib.parse import urljoin
import re
import os
from datetime import date
folder_location = r'./ASN Documents'
if not os.path.exists(folder_location):os.mkdir(folder_location)
我想用这些查询提取 pdf query_list=["maintenance","fukushima"]
url_asn="https://www.asn.fr/recherche?filter_year[from]={}&filter_year[to]={}&limit=50&&search_text={}&sort_type=date&page={}"
while True:
for query in query_list:
for page in range(1, last_page, 1):
format_url=url_asn.format(date.today().year - 10, date.today().year, query, page)
url=format_url.replace(" ", "%20")
req=requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
for link in soup.findAll("a", class_="Teaser-titleLink"):
if link['href'] != '.pdf':
jk="https://www.asn.fr/"+link['href']
reqHTML = requests.get(jk):
soupHTML=BeautifulSoup(reqHTML.content, 'html.parser')
for pdf in soupHTML.select("a[href$='.pdf']"):
filename = os.path.join(folder_location, link.getText()
.rstrip()
.replace("%20", " "))
with open(f"{filename}.pdf", 'wb') as f:
f.write(requests.get(urljoin(url, link['href'])).content)
要查找最后一页,您可以使用find_all
然后访问倒数第二个索引:
import requests
from bs4 import BeautifulSoup
url = 'https://www.asn.fr/recherche?filter_year[from]=2012&filter_year[to]=2022&limit=50&&search_text=maintenance&sort_type=date&page=1'
rsp = requests.get(url)
soup = BeautifulSoup(rsp.content, 'html.parser')
int(soup.find_all("a", {"class": "page-link"})[-2].text)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.