简体   繁体   English

在 Python 中使用 Selenium 进行网页抓取

[英]Webscraping with Selenium in Python

I am trying to webscrape the list of DAOs from masari.io but I am having trouble because I get the following errors:我正在尝试从 masari.io 抓取 DAO 列表,但我遇到了问题,因为我收到以下错误:

DeprecationWarning: executable_path has been deprecated, please pass in a Service object


driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

DevTools listening on ws://127.0.0.1:56691/devtools/browser/b4609671-5e6e-4d25-b09e-4116b3dde4bf
[0525/100030.252:INFO:CONSOLE(1)] "enabling sentry error tracker", source: https://messari.io/static/js/main.977a4794.chunk.js (1)
[0525/100030.951:INFO:CONSOLE(2)] "Unable to refresh token: Login required", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.065:INFO:CONSOLE(2)] "


88b           d88                                                            88
888b         d888                                                            ""
88'8b       d8'88
88 '8b     d8' 88   ,adPPYba,  ,adPPYba,  ,adPPYba,  ,adPPYYba,  8b,dPPYba,  88
88  '8b   d8'  88  a8P_____88  I8[    ""  I8[    ""  ""     'Y8  88P'   "Y8  88
88   '8b d8'   88  8PP"""""""   '"Y8ba,    '"Y8ba,   ,adPPPPP88  88          88
88    '888'    88  "8b,   ,aa  aa    ]8I  aa    ]8I  88,    ,88  88          88
88     '8'     88   '"Ybbd8"'  '"YbbdP"'  '"YbbdP"'  '"8bbdP"Y8  88          88


", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.069:INFO:CONSOLE(2)] "Interested in a CHALLENGE? Check out: https://messari.io/quiz", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
Traceback (most recent call last):
  File "c:/Users/Student/webScrape/scraper.py", line 21, in <module>
    matches = WebDriverWait(driver, 10).until(
  File "C:\Users\Student\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\support\wait.py", line 89, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
Backtrace:
        Ordinal0 [0x0096B8F3+2406643]
        Ordinal0 [0x008FAF31+1945393]
        Ordinal0 [0x007EC748+837448]
        Ordinal0 [0x008192E0+1020640]
        Ordinal0 [0x0081957B+1021307]
        Ordinal0 [0x00846372+1205106]
        Ordinal0 [0x008342C4+1131204]
        Ordinal0 [0x00844682+1197698]
        Ordinal0 [0x00834096+1130646]
        Ordinal0 [0x0080E636+976438]
        Ordinal0 [0x0080F546+980294]
        GetHandleVerifier [0x00BD9612+2498066]
        GetHandleVerifier [0x00BCC920+2445600]
        GetHandleVerifier [0x00A04F2A+579370]
        GetHandleVerifier [0x00A03D36+574774]
        Ordinal0 [0x00901C0B+1973259]
        Ordinal0 [0x00906688+1992328]
        Ordinal0 [0x00906775+1992565]
        Ordinal0 [0x0090F8D1+2029777]
        BaseThreadInitThunk [0x777BFA29+25]
        RtlGetAppContainerNamedObjectPath [0x77B77A7E+286]
        RtlGetAppContainerNamedObjectPath [0x77B77A4E+238]

I know there is an API for messari.io, but I am almost certain it is only for their assets and not their list of DAOs.我知道 messari.io 有一个 API,但我几乎可以肯定它只适用于他们的资产,而不是他们的 DAO 列表。 I tried using Selenium since it is a dynamic page but I am still having trouble.我尝试使用 Selenium,因为它是一个动态页面,但我仍然遇到问题。 Here is my code:这是我的代码:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests

url = 'https://messari.io/governor/daos'

DRIVER_PATH = 'PATH_TO_DRIVER_ON_MY_PC'
options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")

# s = Service('PATH_TO_DRIVER_ON_MY_PC')
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get('https://messari.io/governor/daos')

try:
    matches = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.TAG_NAME, "td")))
    # for match in matches:
    #     print(match.text)

finally:
    driver.quit()

Update I fixed the executable_path warning, but I am still getting the same TimeoutException error.更新我修复了 executable_path 警告,但我仍然收到相同的 TimeoutException 错误。 And when I run it without headless I also get the following message:当我在没有无头的情况下运行它时,我还会收到以下消息:

DevTools listening on ws://127.0.0.1:57773/devtools/browser/4450b78d-3a9f-401a-b39c-2c716ecad924
[9628:20616:0525/102300.840:ERROR:device_event_log_impl.cc(214)] [10:23:00.840] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
[9628:20616:0525/102300.841:ERROR:device_event_log_impl.cc(214)] [10:23:00.841] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)

I assume this part is more of a hardware message that I shouldn't worry about based on similar questions bc when I unplugged my mouse it removed one of them.我认为这部分更像是一条硬件消息,基于类似的问题,我不应该担心,因为当我拔下鼠标时,它删除了其中一个。

This page doesn't use <td> to display list of DAOs.此页面不使用<td>来显示 DAO 列表。
It uses <div> (with CSS ) to display it similar to table.它使用<div> (使用CSS )来显示它类似于表格。

And it keeps name of DAO in <h4>它在<h4>中保留了 DAO 的名称

At least it uses and in my Firefox on laptop with Linux.至少它在我的带有 Linux 的笔记本电脑上的 Firefox 中使用。


Full working code (tested on Linux Mint, Python 3.8, Selenium 4.x, Chrome 101.x)完整的工作代码(在 Linux Mint、Python 3.8、Selenium 4.x、Chrome 101.x 上测试)

I used module webdriver_manager so it automatically downloads fresh driver when Linux installs newer version of Chrome我使用了模块webdriver_manager所以当 Linux 安装较新版本的 Chrome 时它会自动下载新的驱动程序

I have to use find_elements() (with s in word elements ) or presence_of_all_elements_located() to get all <h4> .我必须使用find_elements() (在 word elements中带有s )或presence_of_all_elements_located()来获取所有<h4>

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from webdriver_manager.chrome import ChromeDriverManager

url = 'https://messari.io/governor/daos'

options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")

driver = webdriver.Chrome(options=options, service=Service(ChromeDriverManager().install()))

driver.get('https://messari.io/governor/daos')

try:
    matches = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.TAG_NAME, "h4")))
    
    #matches = driver.find_elements(By.TAG_NAME, "h4")
    
    for match in matches:
        if match.text:
            print(match.text)
finally:
    driver.quit()

Result:结果:

Fei
Rook
Cosmos
Stargate Finance
Aave
Treasure DAO
DODO
Radicle
Goldfinch
Merit Circle
EPNS
Perpetual Protocol
Gitcoin
SuperRare
Indexed
Doodles
Rome DAO
Badger
Paraswap
Unlock
Terra
Shapeshift
Lobis
Pool Together
The Graph
Yearn Finance
Ampleforth
Alpaca Finance
Balancer
Gro Protocol
Sismo DAO
BeethovenX
ENS
Lido
Alchemist

EDIT:编辑:

TO get all values you may have to scroll page - and JavaScript will add new items.要获取所有值,您可能需要滚动页面 - JavaScript 将添加新项目。

There are answers which use while -loop with execute_script() which use JavaScript code to scroll to the bottom and get current height.有些答案使用while -loop 和execute_script() ,它使用 JavaScript 代码滚动到底部并获取当前高度。 If height is different than before scroll then you have to scroll again, but if height is the same then you have end of page and now you can get all items.如果高度与滚动前不同,则您必须再次滚动,但如果高度相同,则您有页尾,现在您可以获取所有项目。

With selenium4 as the key executable_path is deprecated you have to use an instance of the Service() class along with ChromeDriverManager().install() command as discussed below使用selenium4作为 key executable_path已被弃用,您必须使用Service()类的实例以及ChromeDriverManager().install()命令,如下所述

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://www.google.com")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM