如何使用 selenium python 和不同等級的頁面從帶有滾動的 web 頁面中提取所有數據？

Question

我正在嘗試從https://opensea.io/rankings?category=new讀取所有 nfts，在 5 個不同的排名頁面上有 100 個 nfts，總共 500 個 nfts

我的代碼

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://opensea.io/rankings?category=new")
driver.maximize_window()
time.sleep(3)

l= driver.find_element_by_xpath("//div[@role='list']")
nfts = l.find_elements(By.XPATH, "//div[@role='listitem']")
column_name = driver.find_element_by_class_name('ggkQUt')
column_name = column_name.text.split('\n')
my_data = {}
for i in column_name:
    my_data[i] = []
del(my_data['arrow_drop_down'])
print(my_data)

  
for nft in nfts:
    nft = nft.text.split('\n')
    for item, col in zip(nft, my_data.keys()):
        my_data[col].append(item)

這里 nfts 列表只包含 16 個 nfts，我知道這是因為 allnfts 同時在頁面上不可見，我嘗試解決它但找不到任何解決我問題的答案，我是 selenium 的新手任何幫助，將不勝感激

Answer 1

注：基於 Java 的解決方案

當您打開給定的 url 時，不會一次加載所有 100 個 NFT 行。 只有在小步向下滾動時，您才會獲得新的 NFT。 在此觀察的基礎上，我使用以下方法編寫代碼：

啟動瀏覽器並導航到給定的 url
設置 scrollStepSize 和最大數量。 您要從中選擇 NFT 數據的頁面數
[外部循環]對於每個頁面，執行以下操作：
- 編寫邏輯以等待至少一個 NFT 數據的存在（位於 cssSelector - div[role='listitem'] div.cIYIHz span div ）。 這確保了一些 NFT 數據已經加載並准備好被我們的腳本使用
- [INNER LOOP]以小步連續向下滾動並執行以下操作，直到到達頁面底部：
  - 使用定位器 cssSelector 查找所有元素 - div[role='listitem']
  - 對於這些元素/行中的每一個，捕獲不同列的數據，例如 Collection name(cssSelector - div.cIYIHz>span>div )、Volume(cssSelector - div.jYqxGr span.heRNcW div ) 等。以 a 的形式存儲每行數據Map(K,V) 其中 K = 列名稱和 V = 當前行的該列下的值
  - 注意：在這個內部循環中，您可能會在不同的迭代中得到相同的行。 為了避免重復，我使用了不允許重復值的 HashSet。 因此，我的所有數據都存儲為Set<Map<K,V>> ，其中集合中的每個項目對應於一行的數據

Java 代碼（帶DEMO ）：

package usecase;

import java.time.Duration;
import java.util.LinkedHashMap;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;

import io.github.bonigarcia.wdm.WebDriverManager;

public class NFT {
    static WebDriver driver;
    static JavascriptExecutor jse;

    public static WebElement findElement(By by) {
        WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(15));
        return wait.until(ExpectedConditions.elementToBeClickable(by));
    }

    public static List<WebElement> findElements(By by) {
        WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(15));
        return wait.until(ExpectedConditions.presenceOfAllElementsLocatedBy(by));
    }

    public static WebElement findChildElement(WebElement parent, By by) {
        WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(15));
        return wait.until(ExpectedConditions.presenceOfNestedElementLocatedBy(parent, by));
    }

    public static void main(String[] args) throws InterruptedException {
        int stepSize = 400;                                             //page scroll size in pixels
        WebDriverManager.chromedriver().setup();
        driver = new ChromeDriver();
        driver.manage().window().maximize();
        driver.get("https://opensea.io/rankings?category=new");
        Set<Map<String, String>> uniqueNFTs = new LinkedHashSet<Map<String, String>>();
        jse = (JavascriptExecutor) driver;
        int totalPagesToCheck = 2, pageCounter = 1;                     //have set the maximum pages to scrape to 2. You can change it as per your needs
        do {
            long prev = -1L, curr = 0L;
            findElement(By.cssSelector("div[role='listitem'] div.cIYIHz span div"));            //wait for at-least one row's data to be present on the screen
            while (prev != curr) {
                List<WebElement> rows = findElements(By.cssSelector("div[role='listitem']"));       //get all rows
                for (WebElement row : rows) {
                    Map<String, String> rowData = new LinkedHashMap<String, String>();
                    rowData.put("name", findChildElement(row, By.cssSelector("div.cIYIHz>span>div")).getText());        //fetching the Collection name for current/each row
                    rowData.put("volume",
                            findChildElement(row, By.cssSelector("div.jYqxGr span.heRNcW div")).getText());             //fetching the Collection volume for current/each row. You can get other columns also similarly
                    uniqueNFTs.add(rowData);
                }
                jse.executeScript("window.scrollBy(0," + stepSize + ")");                           //scroll down in small steps. Remember, we had set stepSize to 400 above. Change it as per your needs.
                prev = curr;
                curr = (Long) (jse.executeScript("return window.pageYOffset"));
            }
            try {
                findElement(By.cssSelector("i[value='arrow_forward_ios']")).click();
                pageCounter++;
            } catch (Exception e) {
                e.printStackTrace();
                break;
            }
        } while (pageCounter <= totalPagesToCheck);

        System.out.println(uniqueNFTs.size());
        uniqueNFTs.forEach(nft -> System.out.println(nft));
        driver.quit();
    }
}

如何使用 selenium python 和不同等級的頁面從帶有滾動的 web 頁面中提取所有數據？

問題描述

1 個解決方案

解決方案1
1 已采納 2022-01-14 17:35:28

如何使用 selenium python 和不同等級的頁面從帶有滾動的 web 頁面中提取所有數據？

問題描述

1 個解決方案

解決方案1 1 已采納 2022-01-14 17:35:28

解決方案1
1 已采納 2022-01-14 17:35:28