繁体   English   中英

在 Python 中使用 Selenium 进行网页抓取

[英]Webscraping with Selenium in Python

我正在尝试从 masari.io 抓取 DAO 列表,但我遇到了问题,因为我收到以下错误:

DeprecationWarning: executable_path has been deprecated, please pass in a Service object


driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

DevTools listening on ws://127.0.0.1:56691/devtools/browser/b4609671-5e6e-4d25-b09e-4116b3dde4bf
[0525/100030.252:INFO:CONSOLE(1)] "enabling sentry error tracker", source: https://messari.io/static/js/main.977a4794.chunk.js (1)
[0525/100030.951:INFO:CONSOLE(2)] "Unable to refresh token: Login required", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.065:INFO:CONSOLE(2)] "


88b           d88                                                            88
888b         d888                                                            ""
88'8b       d8'88
88 '8b     d8' 88   ,adPPYba,  ,adPPYba,  ,adPPYba,  ,adPPYYba,  8b,dPPYba,  88
88  '8b   d8'  88  a8P_____88  I8[    ""  I8[    ""  ""     'Y8  88P'   "Y8  88
88   '8b d8'   88  8PP"""""""   '"Y8ba,    '"Y8ba,   ,adPPPPP88  88          88
88    '888'    88  "8b,   ,aa  aa    ]8I  aa    ]8I  88,    ,88  88          88
88     '8'     88   '"Ybbd8"'  '"YbbdP"'  '"YbbdP"'  '"8bbdP"Y8  88          88


", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.069:INFO:CONSOLE(2)] "Interested in a CHALLENGE? Check out: https://messari.io/quiz", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
Traceback (most recent call last):
  File "c:/Users/Student/webScrape/scraper.py", line 21, in <module>
    matches = WebDriverWait(driver, 10).until(
  File "C:\Users\Student\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\support\wait.py", line 89, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
Backtrace:
        Ordinal0 [0x0096B8F3+2406643]
        Ordinal0 [0x008FAF31+1945393]
        Ordinal0 [0x007EC748+837448]
        Ordinal0 [0x008192E0+1020640]
        Ordinal0 [0x0081957B+1021307]
        Ordinal0 [0x00846372+1205106]
        Ordinal0 [0x008342C4+1131204]
        Ordinal0 [0x00844682+1197698]
        Ordinal0 [0x00834096+1130646]
        Ordinal0 [0x0080E636+976438]
        Ordinal0 [0x0080F546+980294]
        GetHandleVerifier [0x00BD9612+2498066]
        GetHandleVerifier [0x00BCC920+2445600]
        GetHandleVerifier [0x00A04F2A+579370]
        GetHandleVerifier [0x00A03D36+574774]
        Ordinal0 [0x00901C0B+1973259]
        Ordinal0 [0x00906688+1992328]
        Ordinal0 [0x00906775+1992565]
        Ordinal0 [0x0090F8D1+2029777]
        BaseThreadInitThunk [0x777BFA29+25]
        RtlGetAppContainerNamedObjectPath [0x77B77A7E+286]
        RtlGetAppContainerNamedObjectPath [0x77B77A4E+238]

我知道 messari.io 有一个 API,但我几乎可以肯定它只适用于他们的资产,而不是他们的 DAO 列表。 我尝试使用 Selenium,因为它是一个动态页面,但我仍然遇到问题。 这是我的代码:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests

url = 'https://messari.io/governor/daos'

DRIVER_PATH = 'PATH_TO_DRIVER_ON_MY_PC'
options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")

# s = Service('PATH_TO_DRIVER_ON_MY_PC')
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get('https://messari.io/governor/daos')

try:
    matches = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.TAG_NAME, "td")))
    # for match in matches:
    #     print(match.text)

finally:
    driver.quit()

更新我修复了 executable_path 警告,但我仍然收到相同的 TimeoutException 错误。 当我在没有无头的情况下运行它时,我还会收到以下消息:

DevTools listening on ws://127.0.0.1:57773/devtools/browser/4450b78d-3a9f-401a-b39c-2c716ecad924
[9628:20616:0525/102300.840:ERROR:device_event_log_impl.cc(214)] [10:23:00.840] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
[9628:20616:0525/102300.841:ERROR:device_event_log_impl.cc(214)] [10:23:00.841] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)

我认为这部分更像是一条硬件消息,基于类似的问题,我不应该担心,因为当我拔下鼠标时,它删除了其中一个。

此页面不使用<td>来显示 DAO 列表。
它使用<div> (使用CSS )来显示它类似于表格。

它在<h4>中保留了 DAO 的名称

至少它在我的带有 Linux 的笔记本电脑上的 Firefox 中使用。


完整的工作代码(在 Linux Mint、Python 3.8、Selenium 4.x、Chrome 101.x 上测试)

我使用了模块webdriver_manager所以当 Linux 安装较新版本的 Chrome 时它会自动下载新的驱动程序

我必须使用find_elements() (在 word elements中带有s )或presence_of_all_elements_located()来获取所有<h4>

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from webdriver_manager.chrome import ChromeDriverManager

url = 'https://messari.io/governor/daos'

options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")

driver = webdriver.Chrome(options=options, service=Service(ChromeDriverManager().install()))

driver.get('https://messari.io/governor/daos')

try:
    matches = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.TAG_NAME, "h4")))
    
    #matches = driver.find_elements(By.TAG_NAME, "h4")
    
    for match in matches:
        if match.text:
            print(match.text)
finally:
    driver.quit()

结果:

Fei
Rook
Cosmos
Stargate Finance
Aave
Treasure DAO
DODO
Radicle
Goldfinch
Merit Circle
EPNS
Perpetual Protocol
Gitcoin
SuperRare
Indexed
Doodles
Rome DAO
Badger
Paraswap
Unlock
Terra
Shapeshift
Lobis
Pool Together
The Graph
Yearn Finance
Ampleforth
Alpaca Finance
Balancer
Gro Protocol
Sismo DAO
BeethovenX
ENS
Lido
Alchemist

编辑:

要获取所有值,您可能需要滚动页面 - JavaScript 将添加新项目。

有些答案使用while -loop 和execute_script() ,它使用 JavaScript 代码滚动到底部并获取当前高度。 如果高度与滚动前不同,则您必须再次滚动,但如果高度相同,则您有页尾,现在您可以获取所有项目。

使用selenium4作为 key executable_path已被弃用,您必须使用Service()类的实例以及ChromeDriverManager().install()命令,如下所述

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://www.google.com")

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM