簡體   English   中英

我無法將代碼拆分為函數

[英]I'm not able to split my code into functions

我編寫了一個從網站下載 pdf 的代碼,它運行良好,下載了所有的 PDF(下面的第一個代碼)。 但是,當我將代碼拆分為函數時,只有兩個鏈接插入到“論文”列表中,並且執行以代碼零結束,但出現以下警告消息:

GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 11 of the file C:\Downloads\EditoraCL\download_pdf.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
   for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):

第一個代碼:

import requests
import httplib2
import os
from bs4 import BeautifulSoup, SoupStrainer

papers = []
pdfs = []
http = httplib2.Http()
status, response = http.request('https://www.snh2021.anpuh.org/site/anais')
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        papers.append(link['href'])
        print(papers)

for x in papers:
    if x.endswith('pdf'):
        pdfs.append(x)
    print(pdfs)


def baixa_arquivo(url, endereco):
    resposta = requests.get(url)
    if resposta.status_code == requests.codes.OK:
        with open(endereco, 'wb') as novo_arquivo:
            novo_arquivo.write(resposta.content)
            print('Download concluído. Salvo em {}'.format(endereco))
    else:
        resposta.raise_for_status()


if __name__ == '__main__':
    url_basica = 'https://www.snh2021.anpuh.org/{}'
    output = 'Download'
    for i in range(1, len(pdfs)):
        nome_do_arquivo = os.path.join(output, 'artigo{}.pdf'.format(i))
        a = pdfs[i]
        z = url_basica.format(a)
        y = requests.get(z)
        if y.status_code!=404:
            baixa_arquivo(z, nome_do_arquivo)

代碼分為功能:

import requests
import httplib2
import os
from bs4 import BeautifulSoup, SoupStrainer
papers = []
pdfs = []
def busca_links():

    http = httplib2.Http()
    status, response = http.request('https://www.snh2021.anpuh.org/site/anais')
    for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
        if link.has_attr('href'):
            papers.append(link['href'])
            return papers


def links_pdf():
    for x in papers:
        if x.endswith('pdf'):
            pdfs.append(x)
            return pdfs


def baixa_arquivo(url, endereco):
    resposta = requests.get(url)
    if resposta.status_code == requests.codes.OK:
        with open(endereco, 'wb') as novo_arquivo:
            novo_arquivo.write(resposta.content)
            return f'Download concluído. Salvo em {endereco}'
    else:
        resposta.raise_for_status()


if __name__ == '__main__':
    busca_links()
    links_pdf()
    url_basica = 'https://www.snh2021.anpuh.org/{}'
    output = 'Download'
    print(papers)
    print(pdfs)
    for i in range(1, len(pdfs)):
        nome_do_arquivo = os.path.join(output, 'artigo{}.pdf'.format(i))
        a = pdfs[i]
        z = url_basica.format(a)
        y = requests.get(z)
        if y.status_code!=404:
            baixa_arquivo(z, nome_do_arquivo)

有人能幫我理解為什么第二個代碼會出現這個錯誤嗎?

函數不共享它們的內部變量,因此為了使您的代碼工作,您應該將“論文”分配給函數本身,在函數內部返回它之后( papers = busca_links() 和 links_pdf( papers ))。

無論如何,為了組織和更清晰的代碼,您應該使用類和方法:

import os
import requests
import httplib2
from bs4 import BeautifulSoup, SoupStrainer


class Pdf:

    def __init__(self, base_url, url):
        self.main_dir = os.path.dirname(__file__)
        self.pdfs_dir = os.path.join(self.main_dir, 'pdfs')
        self.base_url = base_url
        self.url = url
        
    def get_links(self):
        http = httplib2.Http()
        status, response = http.request(self.url)
        self.links = []
        for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
            if link.has_attr('href'):
                if link['href'].endswith('pdf'):
                    self.links.append(f"{self.base_url}{link['href']}")

    def download_pdf(self):
        for link in self.links:
            response = requests.get(link, stream=True)
            if response.status_code == 200:
                file_path = os.path.join(self.pdfs_dir, link.split('/')[-1])
                with open(file_path, 'wb') as f:
                    f.write(response.content)
                print('Success. Saved on {}'.format(file_path))
            else:
                # Should handle errors here, by appending them to a list and
                # trying again later.
                print('Error.')


if __name__ == '__main__':
    base_url = 'https://www.snh2021.anpuh.org/'
    url = f'{base_url}site/anais'
    pdf = Pdf(base_url, url)
    pdf.get_links()
    pdf.download_pdf()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM