如何使用 Python + Selenium 从网站下载特定文件

Question

I want to download some specific files from this page .我想从这个页面下载一些特定的文件。

This exact four files:这确切的四个文件：

文件图像

How I am supposed to iterate through the page using selenium in a way I can maintain a good programming practices.我应该如何使用 selenium 迭代页面，以保持良好的编程实践。

Is there a library better than selenium to do it?有比硒更好的图书馆吗？

I really just need some clarifying ideas.我真的只需要一些澄清的想法。

Answer 1

Selenium is not lightweight, it is the last resort. Selenium 不是轻量级的，它是最后的手段。 It mimics the browser, so things like event handling (clicking some element, captcha submission, etc.).它模仿浏览器，所以像事件处理（点击某些元素、验证码提交等）。 Also, if you're trying to scrape a page that uses JavaScript ( dynamically generated data that can not be found when you check the source code of the webpage), Selenium can be a good choice.此外，如果您尝试抓取使用 JavaScript 的页面（检查网页源代码时无法找到的动态生成的数据），Selenium 可能是一个不错的选择。

For any web scraping project, first, search your desired texts in the source code of the web page (press Ctrl+U when you visit the page).对于任何网页抓取项目，首先，在网页的源代码中搜索您想要的文本（访问页面时按Ctrl+U ）。 If the desired element (texts/links etc.) can be found in the source code then you don't need to use a heavyweight library like selenium.如果可以在源代码中找到所需的元素（文本/链接等），那么您不需要使用像 selenium 这样的重量级库。

For this case, the texts you're trying to parse can be found in the source code.对于这种情况，您尝试解析的文本可以在源代码中找到。

so you can use requests library and a simple parser library like bs4所以你可以使用requests库和一个简单的解析器库，比如bs4

For the selectors, check this page - W3Schools Css Selectors对于选择器，请查看此页面 - W3Schools Css 选择器

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'}
url = 'https://www.gov.br/ans/pt-br/assuntos/consumidor/o-que-o-seu-plano-de-saude-deve-cobrir-1/o-que-e-o-rol-de-procedimentos-e-evento-em-saude'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')

paragraphs = soup.select('div#parent-fieldname-text > h3 ~ p') # select all p element which comes after h3 tag inside div with "parent-fieldname-text" id

Output -输出 -

[<p class="callout"><a class="alert-link external-link" href="http://www.ans.gov.br/component/legislacao/?view=legislacao&amp;task=TextoLei&amp;format=raw&amp;id=NDAzMw==" target="_blank">Resolução Normativa n°465/2021</a></p>,
 <p class="callout"><a class="internal-link" data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_I_Rol_2021RN_465.2021_RN473_RN478_RN480_RN513_RN536.pdf" target="_self" title="">Anexo I - Lista completa de procedimentos (.pdf)</a></p>,
 <p class="callout"><a class="internal-link" data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_I_Rol_2021RN_465.2021_RN473_RN478_RN480_RN513_RN536.xlsx" target="_self" title="">Anexo I - Lista completa de procedimentos (.xlsx)</a></p>,
 <p class="callout"><a class="internal-link" data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_II_DUT_2021_RN_465.2021_tea.br_RN473_RN477_RN478_RN480_RN513_RN536.pdf" target="_self" title="">Anexo II - Diretrizes de utilização (.pdf)</a></p>,
 <p class="callout"><a class="internal-link" data-tippreview-enabled="false" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_III_DC_2021_RN_465.2021.v2.pdf" target="_self" title="">Anexo III - Diretrizes clínicas (.pdf)</a></p>,
 <p class="callout"><a class="internal-link" data-tippreview-enabled="false" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_IV_PROUT_2021_RN_465.2021.v2.pdf" target="_self" title="">Anexo IV - Protocolo de utilização (.pdf)</a></p>,
 <p class="callout"><a class="alert-link internal-link" data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/nota13_geas_ggras_dipro_17012013.pdf" target="_blank" title="">Nota sobre as Terminologias</a><br/> Rol de Procedimentos e Eventos em Saúde, Terminologia Unificada da Saúde Suplementar - TUSS e Classificação Brasileira Hierarquizada de Procedimentos Médicos - CBHPM</p>,
 <p class="callout"><a class="internal-link" data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/CorrelacaoTUSS.2021Rol.2021_RN478_RN480_RN513_FU_RN536_20220506.xlsx" target="_self" title="">Correlação TUSS X Rol<br/></a> Correlação entre o Rol de Procedimentos e Eventos em Saúde e a Terminologia Unificada da Saúde Suplementar – TUSS</p>]

You're looking for the first few elements from this output list您正在从此输出列表中查找前几个元素

to download all files use :下载所有文件使用：

for p in paragraphs:
    r = requests.get(p.get_text(), allow_redirects=True)
    open(p.get_text(), 'wb').write(r.content)

如何使用 Python + Selenium 从网站下载特定文件

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-05-25 07:10:11

如何使用 Python + Selenium 从网站下载特定文件

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-05-25 07:10:11

解决方案1
2 已采纳 2022-05-25 07:10:11