使用 selenium 的網頁抓取表僅獲取 html 元素但沒有內容

Question

我正在嘗試使用來自這 3 個網站的 selenium 和beautifulsoup來抓取表格：

https://www.erstebank.hr/hr/tecajna-lista

https://www.otpbanka.hr/tecajna-lista

https://www.sberbank.hr/tecajna-lista/

對於所有 3 個網站，結果是表格的 HTML 代碼，但沒有文本。

我的代碼如下：

import requests
from bs4 import BeautifulSoup
import pyodbc
import datetime

from selenium import webdriver

PATH = r'C:\Users\xxxxxx\AppData\Local\chromedriver.exe'

driver = webdriver.Chrome(PATH)

driver.get('https://www.erstebank.hr/hr/tecajna-lista')

driver.implicitly_wait(10)

soup = BeautifulSoup(driver.page_source, 'lxml')

table = soup.find_all('table')

print(table)

driver.close()

請幫助我錯過了什么？

謝謝

Answer 1

BeautifulSoup 將找不到該表，因為它的參考點不存在該表。 在這里，如果 Selenium 注意到某個元素尚不存在，請告訴 Selenium 暫停Selenium 驅動程序匹配器：

# This only works for the Selenium element matcher
driver.implicitly_wait(10)

然后，在那之后，您將獲得當前的 HTML 狀態（表仍然不存在）並將其放入 BeautifulSoup 的解析器中。 BS4 將無法看到表格，即使它稍后加載，因為它將使用您剛剛提供的當前 HTML 代碼：

# You now move the CURRENT STATE OF THE HTML PAGE to BeautifulSoup's parser
soup = BeautifulSoup(driver.page_source, 'lxml')

# As this is now in BS4's hands, it will parse it immediately (won't wait 10 seconds)
table = soup.find_all('table')

# BS4 finds no tables as, when the page first loads, there are none.

要解決此問題，您可以要求 Selenium 嘗試獲取 HTML 表本身。 由於Selenium 將使用您之前指定的implicitly_wait ，它將等待直到它存在，然后才允許其余的代碼執行持久化。 那時，當 BS4 收到 HTML 代碼時，表格就會在那里。

driver.implicitly_wait(10)

# Selenium will wait until the element is found
# I used XPath, but you can use any other matching sequence to get the table
driver.find_element_by_xpath("/html/body/div[2]/main/div/section/div[2]/div[1]/div/div/div/div/div/div/div[2]/div[6]/div/div[2]/table/tbody/tr[1]")

soup = BeautifulSoup(driver.page_source, 'lxml')

table = soup.find_all('table')

然而，這有點矯枉過正。 是的，您可以使用 Selenium 來解析 HTML，但您也可以只使用requests模塊（從您的代碼中，我看到您已經導入了該模塊）直接獲取表數據。

數據是從該端點異步加載的（您可以使用 Chrome DevTools 自行查找）。 您可以將其與json模塊配對，將其轉換為格式良好的字典。 這種方法不僅速度更快，而且資源占用更少（Selenium 必須打開整個瀏覽器窗口）。

from requests import get
from json import loads

# Get data from URL
data_as_text = get("https://local.erstebank.hr/rproxy/webdocapi/fx/current").text

# Turn to dictionary
data_dictionary = loads(data_as_text)

Answer 2

該網站需要時間來加載table的數據。

要么申請time.sleep

import time

driver.get('https://www.erstebank.hr/hr/tecajna-lista')
time.sleep(10)...

或者應用Explicit wait ，以便在tabel中加載rows 。

import requests
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()

driver.get('https://www.erstebank.hr/hr/tecajna-lista')

wait = WebDriverWait(driver,30)
wait.until(EC.presence_of_all_elements_located((By.XPATH,"//table/tbody/tr[@class='ng-scope']")))

# driver.find_element_by_id("popin_tc_privacy_button_2").click() # Cookie setting pop-up. Works fine even without dealing with this pop-up. 
soup = BeautifulSoup(driver.page_source, 'html5lib')

table = soup.find_all('table')

print(table)

Answer 3

您可以將其用作進一步工作的基礎：-

from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

TDCLASS = 'ng-binding'

options = webdriver.ChromeOptions()
options.add_argument('--headless')
with webdriver.Chrome(options=options) as driver:
    driver.get('https://www.erstebank.hr/hr/tecajna-lista')
    try:
        # There may be a cookie request dialogue which we need to click through
        WebDriverWait(driver, 5).until(EC.presence_of_element_located(
            (By.ID, 'popin_tc_privacy_button_2'))).click()
    except Exception:
        pass  # Probably timed out so ignore on the basis that the dialogue wasn't presented
    # The relevant <td> elements all seem to be of class 'ng-binding' so look for those
    WebDriverWait(driver, 5).until(
        EC.presence_of_element_located((By.CLASS_NAME, TDCLASS)))
    soup = BS(driver.page_source, 'lxml')
    for td in soup.find_all('td', class_=TDCLASS):
        print(td)

使用 selenium 的網頁抓取表僅獲取 html 元素但沒有內容

問題描述

3 個解決方案

解決方案1
0 2021-09-29 11:23:09

解決方案2
0 已采納 2021-09-29 11:36:12

解決方案3
0 2021-09-29 11:40:59

使用 selenium 的網頁抓取表僅獲取 html 元素但沒有內容

問題描述

3 個解決方案

解決方案1 0 2021-09-29 11:23:09

解決方案2 0 已采納 2021-09-29 11:36:12

解決方案3 0 2021-09-29 11:40:59

解決方案1
0 2021-09-29 11:23:09

解決方案2
0 已采納 2021-09-29 11:36:12

解決方案3
0 2021-09-29 11:40:59