简体   繁体   English

使用带有 webdriver.find 函数的 python 过滤使用 selenium 进行网页抓取

[英]Webscraping with selenium using python filtering with webdriver.find function

I'm filtering the fixed income products which appears in this website: https://yubb.com.br/investimentos/renda-fixa?investment_type=cdb&months=3&principal=10000000.0&sort_by=minimum_investment我正在过滤本网站上出现的固定收益产品: https : //yubb.com.br/investimentos/renda-fixa?investment_type=cdb&months=3&principal=10000000.0&sort_by=minimum_investment

Basically, the page have some cards and I wanna know for each page how many cards appears.基本上,该页面有一些卡片,我想知道每个页面出现了多少张卡片。 For example, choosing cdb as type and 3 months, it shows 16 cards, but with another input of months or type of product, it may appear less cards.例如,选择cdb作为type和3months,显示16张卡片,但是如果再输入months或type,可能会出现较少的卡片。

by now, I know how many possible number of pages it will appear looking at "investmentCardContainer__footer", which is a class but the number of cards looks like it shows as style and I dont know how to find it using selenium webdriver.find function.到现在为止,我知道在查看“investmentCardContainer__footer”时会出现多少可能的页数,这是一个类,但卡片的数量看起来像是显示为样式,我不知道如何使用 selenium webdriver.find 函数找到它。

Here's a tip for what i'm looking for:这是我正在寻找的提示:

https://imgur.com/a/8B5TrMe https://imgur.com/a/8B5TrMe

The idea it's to get this number of cards and use it in a loop to get the cards information aggregated in a vector.这个想法是获取这个数量的卡片并在循环中使用它来获取聚合在向量中的卡片信息。

    vetor = ["cdb","lca","lci"]
    dataset_boxes =[]
    now = time.time()
    for i in vetor:
      options = Options()
      options.add_argument('--headless')
      url = 'https://yubb.com.br/investimentos/renda-fixa?investment_type={}&months=12\
        &principal=1000000.0&sort_by=net_return'.format(i)
      driver = webdriver.Chrome("C:\\Users\\yourpath\\Desktop\\PYTHON\\chromedriver.exe",options=options)
      driver.get(url)
      time.sleep(1)
      num_pages = driver.find_element_by_class_name("investmentCardContainer__footer").text
      list_pages = Convert(num_pages)
      last_page  = int(list_pages[len(list_pages)-3])
      driver.quit()
        for j in range(1,last_page+1):
          url2 = 'https://yubb.com.br/investimentos/renda-fixa?collection_page={}&investment_type={}&months=12\
            &principal=1000000.0&sort_by=net_return'.format(j,i)
          driver = webdriver.Chrome("C:\\Users\\yourpath\\Desktop\\PYTHON\\chromedriver.exe",options=options)
          driver.get(url2)
          num_boxes  = driver.find_element_by_class_name("investmentCardContainer__body").text
          list_boxes = Convert(num_boxes)
          dataset_boxes.append(list_boxes)
          driver.quit()
    print('idk')
    later = time.time()
    difference = int(later - now)
    print('Processo finalizado em {} segundos.'.format(difference)) 

Use WebDriverWait and following xpath to get the no of pages count.使用WebDriverWait并遵循xpath来获取no of pages

print(WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,'(//span[@class="page"]//a)[last()]'))).text)

You need to have following imports to execute above code.您需要有以下导入才能执行上述代码。

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

For this link : https://yubb.com.br/investimentos/renda-fixa?investment_type=cdb&months=3&principal=10000000.0&sort_by=minimum_investment对于此链接: https://yubb.com.br/investimentos/renda-fixa?investment_type=cdb&months=3&principal=10000000.0&sort_by=minimum_investment : https://yubb.com.br/investimentos/renda-fixa?investment_type=cdb&months=3&principal=10000000.0&sort_by=minimum_investment

It should return : 8它应该返回: 8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM