简体   繁体   English

使用 selenium WebDriver 提取文本和 web 链接

[英]Extract text and web links with the selenium WebDriver

I'm studying selenium and I want to extract the texts and links from Sympla's events, but when I click on the " more events " button, I can't extract the next events, it is always extracting the same initial events from the page.我正在研究 selenium 并且我想从 Sympla 的事件中提取文本和链接,但是当我单击“更多事件”按钮时,我无法提取下一个事件,它总是从页面中提取相同的初始事件.

Complete class for easy reproduction.完整的 class 便于复制。

public static void main(String[] args) throws InterruptedException {

        WebDriverManager.firefoxdriver().setup();
        WebDriver driver = new FirefoxDriver();
        driver.manage().window().maximize();
        driver.get("https://www.sympla.com.br/eventos?ts=online_mais-de-3-mil-eventos-online");

        driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);

        // If have captcha, close the page and exit.
        boolean captcha = driver.getPageSource().contains("Não sou um robô");

        if (captcha == true) {
            System.out.println("O Captcha apareceu, acabou a brincadeira!");

            driver.close();
            driver.quit();
        }

        // load more button
        WebElement CarregarMais = driver.findElement(By
                .xpath("//button[@id='more-events']"));

        // Number of events counter
        List<WebElement> eventos = (List<WebElement>) driver.findElements(By
                .cssSelector("div.event-name.event-card"));
        System.out.println("Number of links: " + eventos.size());

        // Number of links counter
        List<WebElement> eventos_link = (List<WebElement>) driver
                .findElements(By.cssSelector("a.sympla-card.w-inline-block"));

        // iterating over the button more events
        for (int j = 0; j < eventos.size(); j++) {

            CarregarMais.click();

            @SuppressWarnings("deprecation")
            WebDriverWait wait = new WebDriverWait(driver, 10);
            WebElement element = wait.until(ExpectedConditions
                    .elementToBeClickable(By
                            .xpath("//button[@id='more-events']")));

            // Iterating over event links
            for (int i = 0; i < eventos_link.size(); i++) {

                System.out.println(i + " " + eventos.get(i).getText() + " - "
                        + eventos_link.get(i).getAttribute("href"));
                Thread.sleep(500);

            }

        }

    }

It's because you don't read the links again.这是因为您不再阅读链接。 With every click on the button a new page is created, so you need to read them again.每次单击按钮都会创建一个新页面,因此您需要再次阅读它们。

Furthermore you would need to store the last fetched link.此外,您需要存储最后获取的链接。

So after waiting for the button to be clickable again you need to reread eventos and eventos_link .因此,在等待按钮再次可点击后,您需要重新阅读eventoseventos_link And maybe you use a global variable like lastFetchedLinkIndex .也许您使用像lastFetchedLinkIndex这样的全局变量。

This would be my approach (adjusted your code):这将是我的方法(调整你的代码):

WebDriverManager.firefoxdriver().setup();
WebDriver driver = new FirefoxDriver();
driver.manage().window().maximize();
driver.get("https://www.sympla.com.br/eventos?ts=online_mais-de-3-mil-eventos-online");

driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);

// If have captcha, close the page and exit.
boolean captcha = driver.getPageSource().contains("Não sou um robô");

if (captcha == true) {
    System.out.println("O Captcha apareceu, acabou a brincadeira!");

    driver.close();
    driver.quit();
}

// load more button
WebElement CarregarMais = driver.findElement(By
        .xpath("//button[@id='more-events']"));

// Number of events counter
List<WebElement> eventos = (List<WebElement>) driver.findElements(By
        .cssSelector("div.event-name.event-card"));
System.out.println("Number of links: " + eventos.size());

// Number of links counter
List<WebElement> eventos_link = (List<WebElement>) driver
        .findElements(By.cssSelector("a.sympla-card.w-inline-block"));
int lastEventScraped = 0;
// iterating over the button more events
for (int j = 0; j < eventos.size(); j++) {

    CarregarMais.click();

    @SuppressWarnings("deprecation")
    WebDriverWait wait = new WebDriverWait(driver, 10);
    WebElement element = wait.until(ExpectedConditions
            .elementToBeClickable(By
                    .xpath("//button[@id='more-events']")));

    eventos = (List<WebElement>) driver.findElements(By
            .cssSelector("div.event-name.event-card"));
    eventos_link = (List<WebElement>) driver
            .findElements(By.cssSelector("a.sympla-card.w-inline-block"));
    // Iterating over event links
    for (int i = lastEventScraped; i < eventos_link.size(); i++, lastEventScraped++) {

        System.out.println(i + " " + eventos.get(i).getText() + " - "
                + eventos_link.get(i).getAttribute("href"));
        Thread.sleep(500);
    }

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 单击Selenium Webdriver中的Web链接 - Clicking Web Links in Selenium Webdriver 无法使用Java和Selenium Webdriver单击网页上的所有链接 - Not able to click all links on a web page using Java and Selenium Webdriver java - 提取包含的文本<br>使用 Selenium Webdriver 标记 - java - Extract text which contains <br> tag with Selenium Webdriver 如何使用Selenium Webdriver中的操作类提取文本? - How to extract text using action class in selenium webdriver? 如何使用Selenium WebDriver从变量中提取文本? - How can I extract the text from a variable with Selenium WebDriver? 无法使用 Selenium WebDriver 中的 gettext 提取文本,也无法单击它 - Unable to extract the text using gettext in Selenium WebDriver and also unable to click it 如何<nobr>使用 selenium webdriver</nobr>提取里面的动态文本<nobr>?</nobr> - How to extract the dynamic text inside <nobr> using selenium webdriver? Xpath无法识别Web元素,其中包含Selenium WebDriver Java的文本 - Web Element is not identified by Xpath containing a text for Selenium WebDriver Java 如何在selenium webdriver中从网页中提取所有链接后单击特定链接 - How to click on specific link after extracting all the links from a web page in selenium webdriver Selenium Webdriver从外部文件加载链接 - Selenium webdriver load links from external file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM