簡體   English   中英

使用 selenium WebDriver 提取文本和 web 鏈接

[英]Extract text and web links with the selenium WebDriver

我正在研究 selenium 並且我想從 Sympla 的事件中提取文本和鏈接,但是當我單擊“更多事件”按鈕時,我無法提取下一個事件,它總是從頁面中提取相同的初始事件.

完整的 class 便於復制。

public static void main(String[] args) throws InterruptedException {

        WebDriverManager.firefoxdriver().setup();
        WebDriver driver = new FirefoxDriver();
        driver.manage().window().maximize();
        driver.get("https://www.sympla.com.br/eventos?ts=online_mais-de-3-mil-eventos-online");

        driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);

        // If have captcha, close the page and exit.
        boolean captcha = driver.getPageSource().contains("Não sou um robô");

        if (captcha == true) {
            System.out.println("O Captcha apareceu, acabou a brincadeira!");

            driver.close();
            driver.quit();
        }

        // load more button
        WebElement CarregarMais = driver.findElement(By
                .xpath("//button[@id='more-events']"));

        // Number of events counter
        List<WebElement> eventos = (List<WebElement>) driver.findElements(By
                .cssSelector("div.event-name.event-card"));
        System.out.println("Number of links: " + eventos.size());

        // Number of links counter
        List<WebElement> eventos_link = (List<WebElement>) driver
                .findElements(By.cssSelector("a.sympla-card.w-inline-block"));

        // iterating over the button more events
        for (int j = 0; j < eventos.size(); j++) {

            CarregarMais.click();

            @SuppressWarnings("deprecation")
            WebDriverWait wait = new WebDriverWait(driver, 10);
            WebElement element = wait.until(ExpectedConditions
                    .elementToBeClickable(By
                            .xpath("//button[@id='more-events']")));

            // Iterating over event links
            for (int i = 0; i < eventos_link.size(); i++) {

                System.out.println(i + " " + eventos.get(i).getText() + " - "
                        + eventos_link.get(i).getAttribute("href"));
                Thread.sleep(500);

            }

        }

    }

這是因為您不再閱讀鏈接。 每次單擊按鈕都會創建一個新頁面,因此您需要再次閱讀它們。

此外,您需要存儲最后獲取的鏈接。

因此,在等待按鈕再次可點擊后,您需要重新閱讀eventoseventos_link 也許您使用像lastFetchedLinkIndex這樣的全局變量。

這將是我的方法(調整你的代碼):

WebDriverManager.firefoxdriver().setup();
WebDriver driver = new FirefoxDriver();
driver.manage().window().maximize();
driver.get("https://www.sympla.com.br/eventos?ts=online_mais-de-3-mil-eventos-online");

driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);

// If have captcha, close the page and exit.
boolean captcha = driver.getPageSource().contains("Não sou um robô");

if (captcha == true) {
    System.out.println("O Captcha apareceu, acabou a brincadeira!");

    driver.close();
    driver.quit();
}

// load more button
WebElement CarregarMais = driver.findElement(By
        .xpath("//button[@id='more-events']"));

// Number of events counter
List<WebElement> eventos = (List<WebElement>) driver.findElements(By
        .cssSelector("div.event-name.event-card"));
System.out.println("Number of links: " + eventos.size());

// Number of links counter
List<WebElement> eventos_link = (List<WebElement>) driver
        .findElements(By.cssSelector("a.sympla-card.w-inline-block"));
int lastEventScraped = 0;
// iterating over the button more events
for (int j = 0; j < eventos.size(); j++) {

    CarregarMais.click();

    @SuppressWarnings("deprecation")
    WebDriverWait wait = new WebDriverWait(driver, 10);
    WebElement element = wait.until(ExpectedConditions
            .elementToBeClickable(By
                    .xpath("//button[@id='more-events']")));

    eventos = (List<WebElement>) driver.findElements(By
            .cssSelector("div.event-name.event-card"));
    eventos_link = (List<WebElement>) driver
            .findElements(By.cssSelector("a.sympla-card.w-inline-block"));
    // Iterating over event links
    for (int i = lastEventScraped; i < eventos_link.size(); i++, lastEventScraped++) {

        System.out.println(i + " " + eventos.get(i).getText() + " - "
                + eventos_link.get(i).getAttribute("href"));
        Thread.sleep(500);
    }

}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM