繁体   English   中英

Selenium Python 的循环无法正常工作

[英]Loop is not working properly for Selenium Python

我正在尝试运行以下代码:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome('C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe')
driver.get('https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx')
years_urls = list()
#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years --> id for the year filter
years_elements = driver.find_element_by_id('ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years').find_elements_by_tag_name('a')
for i in range(len(years_elements)):
    years_urls.append(years_elements[i].get_attribute('href'))
newslinks = list()
for k in range(len(years_urls)):
    url = years_urls[k]
    driver.get(url)
    #link-detailpage --> id for the newslinks in each year
    news = driver.find_elements_by_class_name('link-detailpage')
    for j in range(len(news)):
        newslinks.append(news[j].find_element_by_tag_name('a').get_attribute('href'))

当我运行此代码时,新闻链接列表在执行结束时为空。 但是,如果我一行一行地运行它,通过我自己一个一个地分配“k”的值,它就会成功运行。

我在逻辑上哪里出错了。 请帮忙。

看起来冗余代码太多了。 我建议使用线性xpathcss selector来识别元素。 但是,有些页面没有出现新链接,您需要使用try..except来处理。

由于您需要导航每个 url 我建议使用显式等待WebDriverWait()

代码:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver=webdriver.Chrome("C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe")
driver.get("https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx")
allyears=WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"div#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years a")))
yearsurl=[url.get_attribute("href") for url in allyears]

newslinks = list()
for yr in yearsurl:
  driver.get(yr)
  try:
    for element in WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"div.link-detailpage >a"))):
        newslinks.append(element.get_attribute("href"))
  except:
    continue
print(newslinks)

OutPut

['https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/a-problem-solved-at-a-rate-of-knots-the-latest-Top-Story-available-on-CNHIndustrial-com.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/CNH-Industrial-acquires-a-minority-stake-in-Augmenta.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/CNH-Industrial-presents-YOUNIVERSE.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/Calling-of-the-Annual-General-Meeting.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/march/Pages/CNH-Industrial-completes-minority-investment-in-Monarch-Tractor.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/CNH-Industrial-N-V--announces-the-extension-by-one-additional-year-to-March-2026-of-its-syndicated-credit-facility.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/Working-for-a-safer-future-with-World-Class-Manufacturing.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/Behind-the-Wheel-CNH-Industrial-supports-the-growing-hemp-industry-in-North-America.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/CNH-Industrial-employees-in-Italy-to-receive-contractual-bonus-for-2020-results.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/February/Pages/2020-Fourth-Quarter-and-Full-Year-Results.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/The-Iveco-Defence-Vehicles-plant-in-Sete-Lagoas,-Brazil-and-the-New-Holland-Agriculture-facility-in-Croix,-France.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/CNH-Industrial-to-announce-2020-Fourth-Quarter-and-Full-Year-financial-results-on-February-3-2021.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/CNH-Industrial-publishes-its-2021-Corporate-Calendar.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/Iveco-Defence-Vehicles-supplies-third-generation-protected-military-GTF8x8-(ZLK-15t)-trucks-to-the-German-Army.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/STEYR-New-Holland-Agriculture-CASE-Construction-Equipment-and-FPT-Industrial-win-prestigious-2020-Good-Design%C2%AE-Awards.aspx', 'https://www.cnhindustrial.com/en-us/media/press_releases/2021/january/Pages/CNH-Industrial-completes-the-acquisition-of-four-divisions-of-CEG-in-South-Africa.aspx',so on...] 

更新

如果您不想使用最佳实践的 webdriverwait,请使用time.sleep()因为页面需要一些时间来加载,并且元素在交互之前应该是可见的。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome("C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe")
driver.get('https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx')
years_urls = list()
time.sleep(5)
#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years --> id for the year filter
years_elements = driver.find_elements_by_xpath('//div[@id="ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years"]//a')
for i in range(len(years_elements)):
    years_urls.append(years_elements[i].get_attribute('href'))
print(years_urls)
newslinks = list()
for k in range(len(years_urls)):
    url = years_urls[k]
    driver.get(url)
    time.sleep(3)    
    news = driver.find_elements_by_xpath('//div[@class="link-detailpage"]/a')
    for j in range(len(news)):
        newslinks.append(news[j].get_attribute('href'))
        
print(newslinks)

有一个弹出窗口要求您接受您需要事先单击的 cookies。

将此添加到您的脚本中:

WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyButtonAccept")))
driver.find_element_by_id("CybotCookiebotDialogBodyButtonAccept").click()

所以最终的结果将是:

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome('C:/Users/SoumyaPandey/Desktop/Galytix/Scrapers/data_ingestion/chromedriver.exe')
driver.get('https://www.cnhindustrial.com/en-us/media/press_releases/Pages/default.aspx')

# this part is added, together with the necessary imports
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyButtonAccept")))
driver.find_element_by_id("CybotCookiebotDialogBodyButtonAccept").click()

years_urls = list()
#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years --> id for the year filter
# years_elements = driver.find_element_by_css_selector("#ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years")
years_elements = driver.find_element_by_id('ctl00_ctl33_g_8893c127_d0ad_40f2_9856_d85936172f35_years').find_elements_by_tag_name('a')
for i in range(len(years_elements)):
    years_urls.append(years_elements[i].get_attribute('href'))
newslinks = list()
for k in range(len(years_urls)):
    url = years_urls[k]
    driver.get(url)
    #link-detailpage --> id for the newslinks in each year
    news = driver.find_elements_by_class_name('link-detailpage')
    for j in range(len(news)):
        newslinks.append(news[j].find_element_by_tag_name('a').get_attribute('href'))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM