![](/img/trans.png)
[英]I'm not able to get this code to print what's needed. I think my functions are coded incorrectly
[英]I'm not able to split my code into functions
我編寫了一個從網站下載 pdf 的代碼,它運行良好,下載了所有的 PDF(下面的第一個代碼)。 但是,當我將代碼拆分為函數時,只有兩個鏈接插入到“論文”列表中,並且執行以代碼零結束,但出現以下警告消息:
GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 11 of the file C:\Downloads\EditoraCL\download_pdf.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
第一個代碼:
import requests
import httplib2
import os
from bs4 import BeautifulSoup, SoupStrainer
papers = []
pdfs = []
http = httplib2.Http()
status, response = http.request('https://www.snh2021.anpuh.org/site/anais')
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
if link.has_attr('href'):
papers.append(link['href'])
print(papers)
for x in papers:
if x.endswith('pdf'):
pdfs.append(x)
print(pdfs)
def baixa_arquivo(url, endereco):
resposta = requests.get(url)
if resposta.status_code == requests.codes.OK:
with open(endereco, 'wb') as novo_arquivo:
novo_arquivo.write(resposta.content)
print('Download concluído. Salvo em {}'.format(endereco))
else:
resposta.raise_for_status()
if __name__ == '__main__':
url_basica = 'https://www.snh2021.anpuh.org/{}'
output = 'Download'
for i in range(1, len(pdfs)):
nome_do_arquivo = os.path.join(output, 'artigo{}.pdf'.format(i))
a = pdfs[i]
z = url_basica.format(a)
y = requests.get(z)
if y.status_code!=404:
baixa_arquivo(z, nome_do_arquivo)
代碼分為功能:
import requests
import httplib2
import os
from bs4 import BeautifulSoup, SoupStrainer
papers = []
pdfs = []
def busca_links():
http = httplib2.Http()
status, response = http.request('https://www.snh2021.anpuh.org/site/anais')
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
if link.has_attr('href'):
papers.append(link['href'])
return papers
def links_pdf():
for x in papers:
if x.endswith('pdf'):
pdfs.append(x)
return pdfs
def baixa_arquivo(url, endereco):
resposta = requests.get(url)
if resposta.status_code == requests.codes.OK:
with open(endereco, 'wb') as novo_arquivo:
novo_arquivo.write(resposta.content)
return f'Download concluído. Salvo em {endereco}'
else:
resposta.raise_for_status()
if __name__ == '__main__':
busca_links()
links_pdf()
url_basica = 'https://www.snh2021.anpuh.org/{}'
output = 'Download'
print(papers)
print(pdfs)
for i in range(1, len(pdfs)):
nome_do_arquivo = os.path.join(output, 'artigo{}.pdf'.format(i))
a = pdfs[i]
z = url_basica.format(a)
y = requests.get(z)
if y.status_code!=404:
baixa_arquivo(z, nome_do_arquivo)
有人能幫我理解為什么第二個代碼會出現這個錯誤嗎?
函數不共享它們的內部變量,因此為了使您的代碼工作,您應該將“論文”分配給函數本身,在函數內部返回它之后( papers = busca_links() 和 links_pdf( papers ))。
無論如何,為了組織和更清晰的代碼,您應該使用類和方法:
import os
import requests
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
class Pdf:
def __init__(self, base_url, url):
self.main_dir = os.path.dirname(__file__)
self.pdfs_dir = os.path.join(self.main_dir, 'pdfs')
self.base_url = base_url
self.url = url
def get_links(self):
http = httplib2.Http()
status, response = http.request(self.url)
self.links = []
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if link['href'].endswith('pdf'):
self.links.append(f"{self.base_url}{link['href']}")
def download_pdf(self):
for link in self.links:
response = requests.get(link, stream=True)
if response.status_code == 200:
file_path = os.path.join(self.pdfs_dir, link.split('/')[-1])
with open(file_path, 'wb') as f:
f.write(response.content)
print('Success. Saved on {}'.format(file_path))
else:
# Should handle errors here, by appending them to a list and
# trying again later.
print('Error.')
if __name__ == '__main__':
base_url = 'https://www.snh2021.anpuh.org/'
url = f'{base_url}site/anais'
pdf = Pdf(base_url, url)
pdf.get_links()
pdf.download_pdf()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.