繁体   English   中英

使用 selenium WebDriver 提取文本和 web 链接

[英]Extract text and web links with the selenium WebDriver

我正在研究 selenium 并且我想从 Sympla 的事件中提取文本和链接,但是当我单击“更多事件”按钮时,我无法提取下一个事件,它总是从页面中提取相同的初始事件.

完整的 class 便于复制。

public static void main(String[] args) throws InterruptedException {

        WebDriverManager.firefoxdriver().setup();
        WebDriver driver = new FirefoxDriver();
        driver.manage().window().maximize();
        driver.get("https://www.sympla.com.br/eventos?ts=online_mais-de-3-mil-eventos-online");

        driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);

        // If have captcha, close the page and exit.
        boolean captcha = driver.getPageSource().contains("Não sou um robô");

        if (captcha == true) {
            System.out.println("O Captcha apareceu, acabou a brincadeira!");

            driver.close();
            driver.quit();
        }

        // load more button
        WebElement CarregarMais = driver.findElement(By
                .xpath("//button[@id='more-events']"));

        // Number of events counter
        List<WebElement> eventos = (List<WebElement>) driver.findElements(By
                .cssSelector("div.event-name.event-card"));
        System.out.println("Number of links: " + eventos.size());

        // Number of links counter
        List<WebElement> eventos_link = (List<WebElement>) driver
                .findElements(By.cssSelector("a.sympla-card.w-inline-block"));

        // iterating over the button more events
        for (int j = 0; j < eventos.size(); j++) {

            CarregarMais.click();

            @SuppressWarnings("deprecation")
            WebDriverWait wait = new WebDriverWait(driver, 10);
            WebElement element = wait.until(ExpectedConditions
                    .elementToBeClickable(By
                            .xpath("//button[@id='more-events']")));

            // Iterating over event links
            for (int i = 0; i < eventos_link.size(); i++) {

                System.out.println(i + " " + eventos.get(i).getText() + " - "
                        + eventos_link.get(i).getAttribute("href"));
                Thread.sleep(500);

            }

        }

    }

这是因为您不再阅读链接。 每次单击按钮都会创建一个新页面,因此您需要再次阅读它们。

此外,您需要存储最后获取的链接。

因此,在等待按钮再次可点击后,您需要重新阅读eventoseventos_link 也许您使用像lastFetchedLinkIndex这样的全局变量。

这将是我的方法(调整你的代码):

WebDriverManager.firefoxdriver().setup();
WebDriver driver = new FirefoxDriver();
driver.manage().window().maximize();
driver.get("https://www.sympla.com.br/eventos?ts=online_mais-de-3-mil-eventos-online");

driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);

// If have captcha, close the page and exit.
boolean captcha = driver.getPageSource().contains("Não sou um robô");

if (captcha == true) {
    System.out.println("O Captcha apareceu, acabou a brincadeira!");

    driver.close();
    driver.quit();
}

// load more button
WebElement CarregarMais = driver.findElement(By
        .xpath("//button[@id='more-events']"));

// Number of events counter
List<WebElement> eventos = (List<WebElement>) driver.findElements(By
        .cssSelector("div.event-name.event-card"));
System.out.println("Number of links: " + eventos.size());

// Number of links counter
List<WebElement> eventos_link = (List<WebElement>) driver
        .findElements(By.cssSelector("a.sympla-card.w-inline-block"));
int lastEventScraped = 0;
// iterating over the button more events
for (int j = 0; j < eventos.size(); j++) {

    CarregarMais.click();

    @SuppressWarnings("deprecation")
    WebDriverWait wait = new WebDriverWait(driver, 10);
    WebElement element = wait.until(ExpectedConditions
            .elementToBeClickable(By
                    .xpath("//button[@id='more-events']")));

    eventos = (List<WebElement>) driver.findElements(By
            .cssSelector("div.event-name.event-card"));
    eventos_link = (List<WebElement>) driver
            .findElements(By.cssSelector("a.sympla-card.w-inline-block"));
    // Iterating over event links
    for (int i = lastEventScraped; i < eventos_link.size(); i++, lastEventScraped++) {

        System.out.println(i + " " + eventos.get(i).getText() + " - "
                + eventos_link.get(i).getAttribute("href"));
        Thread.sleep(500);
    }

}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM