简体   繁体   中英

how to extract all data from a web page with a scroll using selenium python and different rank pages?

I am trying to read all the nfts from https://opensea.io/rankings?category=new that has 100 nfts on 5 different rank pages with total of 500 nfts

my code

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://opensea.io/rankings?category=new")
driver.maximize_window()
time.sleep(3)

l= driver.find_element_by_xpath("//div[@role='list']")
nfts = l.find_elements(By.XPATH, "//div[@role='listitem']")
column_name = driver.find_element_by_class_name('ggkQUt')
column_name = column_name.text.split('\n')
my_data = {}
for i in column_name:
    my_data[i] = []
del(my_data['arrow_drop_down'])
print(my_data)

  
for nft in nfts:
    nft = nft.text.split('\n')
    for item, col in zip(nft, my_data.keys()):
        my_data[col].append(item)

Here nfts list just contains 16 nfts, I cam to know that this is as allnfts are not visible on the page at the same time, I tried resolving it but couldn't find any answer that is resolving my problem, I am new to selenium any help would be appreciated

Note: Java based solution

When you open the given url, all the 100 NFT rows do not get loaded at once. You will get new NFTs only when you scroll down in small steps. On the basis of this observation, I wrote the code with the following approach:

  • Launch browser and navigate to the given url
  • Set the scrollStepSize and maximum no. of pages from where you want to pick the NFT data
  • [OUTER LOOP] For each Page, do the following:
    • write logic to wait for the presence of at-least one NFT data(located by cssSelector - div[role='listitem'] div.cIYIHz span div ). This makes sure that some NFT data has been loaded and is ready to be consumed by our script
    • [INNER LOOP] Continuously scroll down in small steps and do the following until the page bottom is reached:
      • find all the elements with the locator cssSelector - div[role='listitem']
      • for each of these elements/rows, capture different columns' data like Collection name(cssSelector - div.cIYIHz>span>div ), Volume(cssSelector - div.jYqxGr span.heRNcW div ) etc. Store each rows data in form of a Map(K,V) where K = column Name and V = value under that column for the current Row
      • Note: In this inner loop, you may get the same row in different iterations. To avoid duplicates, I am using HashSet which does not allow duplicate values. So, all my data is getting stored as Set<Map<K,V>> where each item in the set corresponds to one-row's data

Java Code( with DEMO ):

package usecase;

import java.time.Duration;
import java.util.LinkedHashMap;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;

import io.github.bonigarcia.wdm.WebDriverManager;

public class NFT {
    static WebDriver driver;
    static JavascriptExecutor jse;

    public static WebElement findElement(By by) {
        WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(15));
        return wait.until(ExpectedConditions.elementToBeClickable(by));
    }

    public static List<WebElement> findElements(By by) {
        WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(15));
        return wait.until(ExpectedConditions.presenceOfAllElementsLocatedBy(by));
    }

    public static WebElement findChildElement(WebElement parent, By by) {
        WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(15));
        return wait.until(ExpectedConditions.presenceOfNestedElementLocatedBy(parent, by));
    }

    public static void main(String[] args) throws InterruptedException {
        int stepSize = 400;                                             //page scroll size in pixels
        WebDriverManager.chromedriver().setup();
        driver = new ChromeDriver();
        driver.manage().window().maximize();
        driver.get("https://opensea.io/rankings?category=new");
        Set<Map<String, String>> uniqueNFTs = new LinkedHashSet<Map<String, String>>();
        jse = (JavascriptExecutor) driver;
        int totalPagesToCheck = 2, pageCounter = 1;                     //have set the maximum pages to scrape to 2. You can change it as per your needs
        do {
            long prev = -1L, curr = 0L;
            findElement(By.cssSelector("div[role='listitem'] div.cIYIHz span div"));            //wait for at-least one row's data to be present on the screen
            while (prev != curr) {
                List<WebElement> rows = findElements(By.cssSelector("div[role='listitem']"));       //get all rows
                for (WebElement row : rows) {
                    Map<String, String> rowData = new LinkedHashMap<String, String>();
                    rowData.put("name", findChildElement(row, By.cssSelector("div.cIYIHz>span>div")).getText());        //fetching the Collection name for current/each row
                    rowData.put("volume",
                            findChildElement(row, By.cssSelector("div.jYqxGr span.heRNcW div")).getText());             //fetching the Collection volume for current/each row. You can get other columns also similarly
                    uniqueNFTs.add(rowData);
                }
                jse.executeScript("window.scrollBy(0," + stepSize + ")");                           //scroll down in small steps. Remember, we had set stepSize to 400 above. Change it as per your needs.
                prev = curr;
                curr = (Long) (jse.executeScript("return window.pageYOffset"));
            }
            try {
                findElement(By.cssSelector("i[value='arrow_forward_ios']")).click();
                pageCounter++;
            } catch (Exception e) {
                e.printStackTrace();
                break;
            }
        } while (pageCounter <= totalPagesToCheck);

        System.out.println(uniqueNFTs.size());
        uniqueNFTs.forEach(nft -> System.out.println(nft));
        driver.quit();
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM