简体   繁体   中英

Extract text and web links with the selenium WebDriver

I'm studying selenium and I want to extract the texts and links from Sympla's events, but when I click on the " more events " button, I can't extract the next events, it is always extracting the same initial events from the page.

Complete class for easy reproduction.

public static void main(String[] args) throws InterruptedException {

        WebDriverManager.firefoxdriver().setup();
        WebDriver driver = new FirefoxDriver();
        driver.manage().window().maximize();
        driver.get("https://www.sympla.com.br/eventos?ts=online_mais-de-3-mil-eventos-online");

        driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);

        // If have captcha, close the page and exit.
        boolean captcha = driver.getPageSource().contains("Não sou um robô");

        if (captcha == true) {
            System.out.println("O Captcha apareceu, acabou a brincadeira!");

            driver.close();
            driver.quit();
        }

        // load more button
        WebElement CarregarMais = driver.findElement(By
                .xpath("//button[@id='more-events']"));

        // Number of events counter
        List<WebElement> eventos = (List<WebElement>) driver.findElements(By
                .cssSelector("div.event-name.event-card"));
        System.out.println("Number of links: " + eventos.size());

        // Number of links counter
        List<WebElement> eventos_link = (List<WebElement>) driver
                .findElements(By.cssSelector("a.sympla-card.w-inline-block"));

        // iterating over the button more events
        for (int j = 0; j < eventos.size(); j++) {

            CarregarMais.click();

            @SuppressWarnings("deprecation")
            WebDriverWait wait = new WebDriverWait(driver, 10);
            WebElement element = wait.until(ExpectedConditions
                    .elementToBeClickable(By
                            .xpath("//button[@id='more-events']")));

            // Iterating over event links
            for (int i = 0; i < eventos_link.size(); i++) {

                System.out.println(i + " " + eventos.get(i).getText() + " - "
                        + eventos_link.get(i).getAttribute("href"));
                Thread.sleep(500);

            }

        }

    }

It's because you don't read the links again. With every click on the button a new page is created, so you need to read them again.

Furthermore you would need to store the last fetched link.

So after waiting for the button to be clickable again you need to reread eventos and eventos_link . And maybe you use a global variable like lastFetchedLinkIndex .

This would be my approach (adjusted your code):

WebDriverManager.firefoxdriver().setup();
WebDriver driver = new FirefoxDriver();
driver.manage().window().maximize();
driver.get("https://www.sympla.com.br/eventos?ts=online_mais-de-3-mil-eventos-online");

driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);

// If have captcha, close the page and exit.
boolean captcha = driver.getPageSource().contains("Não sou um robô");

if (captcha == true) {
    System.out.println("O Captcha apareceu, acabou a brincadeira!");

    driver.close();
    driver.quit();
}

// load more button
WebElement CarregarMais = driver.findElement(By
        .xpath("//button[@id='more-events']"));

// Number of events counter
List<WebElement> eventos = (List<WebElement>) driver.findElements(By
        .cssSelector("div.event-name.event-card"));
System.out.println("Number of links: " + eventos.size());

// Number of links counter
List<WebElement> eventos_link = (List<WebElement>) driver
        .findElements(By.cssSelector("a.sympla-card.w-inline-block"));
int lastEventScraped = 0;
// iterating over the button more events
for (int j = 0; j < eventos.size(); j++) {

    CarregarMais.click();

    @SuppressWarnings("deprecation")
    WebDriverWait wait = new WebDriverWait(driver, 10);
    WebElement element = wait.until(ExpectedConditions
            .elementToBeClickable(By
                    .xpath("//button[@id='more-events']")));

    eventos = (List<WebElement>) driver.findElements(By
            .cssSelector("div.event-name.event-card"));
    eventos_link = (List<WebElement>) driver
            .findElements(By.cssSelector("a.sympla-card.w-inline-block"));
    // Iterating over event links
    for (int i = lastEventScraped; i < eventos_link.size(); i++, lastEventScraped++) {

        System.out.println(i + " " + eventos.get(i).getText() + " - "
                + eventos_link.get(i).getAttribute("href"));
        Thread.sleep(500);
    }

}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM