繁体   English   中英

使用 python 和 selenium,如何在网站上找到文件的隐藏链接?

[英]With python and selenium, how to find the hidden links of files on a website?

在 python3 和 selenium 中,我想从一页捕获 PDF 文件链接。 在 Inspect Element 我没有找到这些链接,似乎它们是生成的

所以在网站上我寻找了确切的位置,“Documentos”链接框 - 其中有一个链接列表(Certidão),当你点击它时会打开一个带有 PDF 的新标签 - 示例

然后我制作了下面的脚本,在 PDF 链接框中查找 XPATH 元素,然后调用 function 来查找链接的确切属性

但它不起作用。 请问有谁知道我可以做些什么来解决这个问题或其他方法?

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select


site = "https://divulgacandcontas.tse.jus.br/divulga/#/candidato/2022/2040602022/AP/30001653385"


# Function to get the links with attribute
def find(elem):
    element = elem.get_attribute("dvg-link-doc dvg-certidao")
    if element:
        return element
    else:
        return False

driver = webdriver.Chrome('D:\Code\chromedriver.exe') 
driver.get(site)


documentss = []
# Look for the elements in the box where the PDFs are
elems = driver.find_elements("xpath", '/html/body/div[2]/div[1]/div/div[1]/section[3]/div/div[3]/div[2]/div/div/ul')


# Iterate over the elements found
for elem in elems:
    
              
    # Test if there is a link available
    try:
        links = WebDriverWait(elem, 2).until(find)
        print(links)
        
        if links.endswith(".pdf"):
            print(links)
            dicionario = {"link": links}
            documents.append(dicionario)
        
    except:
        continue

这是在“Documentos”(棕色链接)下获取 pdf 文件的 URL 的一种方法:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t


chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

url = "https://divulgacandcontas.tse.jus.br/divulga/#/candidato/2022/2040602022/AP/30001653385"

counter = 0

browser.get(url) 



links = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".dvg-link-doc.dvg-certidao")))
for x in range(len(links)):
    current_link = links[counter]
    print(current_link.text)
    t.sleep(1)
    current_link.click()
    t.sleep(1)
    browser.switch_to.window(browser.window_handles[-1])
    print(browser.current_url)
    t.sleep(1)
    browser.get(url) 
    counter = counter + 1
    links = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".dvg-link-doc.dvg-certidao")))
    t.sleep(1)

这将在终端打印出来:

Certidão criminal da Justiça Federal de 2º grau
https://divulgacandcontas.tse.jus.br/candidaturas/oficial/2022/BR/AP/546/candidatos/897646/12_1659631723977.pdf
Certidão criminal da Justiça Federal de 1º grau
https://divulgacandcontas.tse.jus.br/candidaturas/oficial/2022/BR/AP/546/candidatos/897646/11_1659631722277.pdf
Certidão criminal da Justiça Estadual de 2º grau
https://divulgacandcontas.tse.jus.br/candidaturas/oficial/2022/BR/AP/546/candidatos/897646/14_1659631720538.pdf
Certidão criminal da Justiça Estadual de 1º grau
https://divulgacandcontas.tse.jus.br/candidaturas/oficial/2022/BR/AP/546/candidatos/897646/13_1659631719616.pdf

您需要使代码适应您自己的 selenium 设置,只需在定义浏览器/驱动程序后观察导入和代码。 Selenium 文档: https://www.selenium.dev/documentation/

我希望您想要的是首先找到页面上的所有链接( 相关)。 从那里,我会得到 href element.get_attribute("href") ,如果它以.pdf结尾,我会假设它是 pdf。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM