簡體   English   中英

使用 Selenium Webdriver 停止頁面加載

[英]stop page loading with Selenium Webdriver

此時,如果網頁中存在大約 5 個不同類型的關鍵字,我的腳本將檢查多個 url。 根據是否找到哪個關鍵字,它將 output “ok”或“no”。

我使用set_page_load_timeout(30)來避免 url 的無限負載。

問題:一些網頁在超時之前沒有完全加載(即使它是一個“非常”長的超時)。 但我可以在視覺上(沒有無頭)看到頁面已加載。 至少它可以檢查網頁中的關鍵字,但它沒有,並且在超時后,它顯示“失敗”並且說“否”的刮擦不會顯示到最終的 output。

所以我不想在 30 秒后放置一個 except,但我想在 30 秒后停止加載頁面並采取它可以采取的措施。

我的代碼:

# coding=utf-8
import re

sites=[]

keywords_1=[]
keywords_2=[]
keywords_3=[]
keywords_4=[]
keywords_5=[]

import sys
from selenium import webdriver
import csv
import urllib.parse
from datetime import datetime
from datetime import date

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.chrome.options import Options


def reader3(filename):
    with open(filename, 'r') as csvfile:
        # creating a csv reader object
        csvreader = csv.reader(csvfile)
        # extracting field names through first row
        # extracting each data row one by one
        for row in csvreader:
            sites.append(str(row[0]).lower())
try:
    reader3("data/script/filter_domain_OUTPUT.csv")
except Exception as e:
    print(e)
    sys.exit()
exc=[]
def reader3(filename):
    with open(filename, 'r') as csvfile:
        # creating a csv reader object
        csvreader = csv.reader(csvfile)
        # extracting field names through first row
        # extracting each data row one by one
        for row in csvreader:
            exc.append(str(row[0]).lower())
try:
    reader3("data/script/checking_EXCLUDE.csv")
except Exception as e:
    print(e)
    sys.exit()
def reader2(filename):
    with open(filename, 'r') as csvfile:
        # creating a csv reader object
        csvreader = csv.reader(csvfile)
        # extracting field names through first row
        # extracting each data row one by one
        for row in csvreader:
            keywords_1.append(str(row[0]).lower())
            keywords_2.append(str(row[1]).lower())
            keywords_3.append(str(row[2]).lower())
            keywords_4.append(str(row[3]).lower())
            keywords_5.append(str(row[4]).lower())

try:
    reader2("data/script/checking_KEYWORD.csv")
except Exception as e:
    print(e)
    sys.exit()


chrome_options = Options()
chrome_options.page_load_strategy = 'none'
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--lang=en')
chrome_options.add_argument('--disable-notifications')
#chrome_options.headless = True

chrome_options.add_argument('start-maximized')
chrome_options.add_argument('enable-automation')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-browser-side-navigation')
chrome_options.add_argument('--disable-gpu')

driver = webdriver.Chrome(options=chrome_options)
for site in sites:
    try:
        status_1 = "no"
        status_2 = "no"
        status_3 = "no"
        status_4 = "no"
        status_5 = "no"
        now = datetime.now()
        current_time = now.strftime("%H:%M:%S")
        today = date.today()
        print("[" + current_time + "] " + str(site))
        if 'http' in site:
            driver.get(site)
        else:
            driver.get("http://" + site)
        r=str(driver.page_source).lower()
        driver.set_page_load_timeout(30)
        for keyword_1 in keywords_1:
            if keyword_1 in r:
                status_1="ok"
                print("home -> " +str(keyword_1))
                break

        for keyword_2 in keywords_2:
            if keyword_2 in r:
                status_2="ok"
                print("home -> " +str(keyword_2))
                break

        for keyword_3 in keywords_3:
            if keyword_3 in r:
                status_3="ok"
                print("home -> " +str(keyword_3))
                break

        for keyword_4 in keywords_4:
            if keyword_4 in r:
                status_4="ok"
                print("home -> " +str(keyword_4))
                break
        for keyword_5 in keywords_5:
            if keyword_5 in r:
                status_5="ok"
                print("Home ->" +str(keyword_5))
                break 
        with open('data/script/checking_OUTPUT.csv', mode='a') as employee_file:
            employee_writer = csv.writer(employee_file, delimiter=';', quotechar='"', quoting=csv.QUOTE_MINIMAL,lineterminator='\n')
            write=[site,status_1,status_2,status_3,status_4,status_5]
            employee_writer.writerow(write)
            
        
    except Exception as e:
        #driver.delete_all_cookies()
        print("Fail")
driver.quit()

https://www.selenium.dev/documentation/en/webdriver/page_loading_strategy/#:~:text=Defines%20the%20current%20session's%20page,loading%0%20takes%20lot%2

    chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER);
    WebDriver driver = new ChromeDriver(chromeOptions);

使用頁面加載策略 只等到初始 html 加載,您也可以使用 none,但如果出現計時問題,請確保您有顯式/隱式等待元素

在 python 中,它的工作很奇怪,只有雜亂無章的功能在工作

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

caps = DesiredCapabilities().CHROME
# caps["pageLoadStrategy"] = "normal"  #  Waits for full page load
caps["pageLoadStrategy"] = "none"

options = Options()

driver = webdriver.Chrome(desired_capabilities=caps, options=options)




url = 'https://www.gm-trucks.com/'


driver.get(url)
print(driver.title)
print("hi")
input()

或者:

options = Options()

options.set_capability("pageLoadStrategy", "none")
driver = webdriver.Chrome(options=options)

更新

文檔按照 selenium 4.0.0-alpha-7 更新

所以使用上述解決方案或更新到 selenium v4 以備將來保護

  pip install selenium==4.0.0.a7

漏洞

https://github.com/SeleniumHQ/seleniumhq.github.io/issues/627

首先,理想情況下set_page_load_timeout()page_load_strategy = 'none'不應該放在一起。

set_page_load_timeout()

set_page_load_timeout()設置在引發錯誤之前等待頁面加載完成的時間量。

您可以在How to set the timeout of 'driver.get' for python selenium 3.8.0?


page_load_strategy

page_load_strategy = 'none'導致Selenium在完全接收到初始頁面內容(已下載 html 內容)后立即返回。

您可以在How to set the timeout of 'driver.get' for python selenium 3.8.0?

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM