[英]Extract text and web links with the selenium WebDriver
我正在研究 selenium 并且我想从 Sympla 的事件中提取文本和链接,但是当我单击“更多事件”按钮时,我无法提取下一个事件,它总是从页面中提取相同的初始事件.
完整的 class 便于复制。
public static void main(String[] args) throws InterruptedException {
WebDriverManager.firefoxdriver().setup();
WebDriver driver = new FirefoxDriver();
driver.manage().window().maximize();
driver.get("https://www.sympla.com.br/eventos?ts=online_mais-de-3-mil-eventos-online");
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
// If have captcha, close the page and exit.
boolean captcha = driver.getPageSource().contains("Não sou um robô");
if (captcha == true) {
System.out.println("O Captcha apareceu, acabou a brincadeira!");
driver.close();
driver.quit();
}
// load more button
WebElement CarregarMais = driver.findElement(By
.xpath("//button[@id='more-events']"));
// Number of events counter
List<WebElement> eventos = (List<WebElement>) driver.findElements(By
.cssSelector("div.event-name.event-card"));
System.out.println("Number of links: " + eventos.size());
// Number of links counter
List<WebElement> eventos_link = (List<WebElement>) driver
.findElements(By.cssSelector("a.sympla-card.w-inline-block"));
// iterating over the button more events
for (int j = 0; j < eventos.size(); j++) {
CarregarMais.click();
@SuppressWarnings("deprecation")
WebDriverWait wait = new WebDriverWait(driver, 10);
WebElement element = wait.until(ExpectedConditions
.elementToBeClickable(By
.xpath("//button[@id='more-events']")));
// Iterating over event links
for (int i = 0; i < eventos_link.size(); i++) {
System.out.println(i + " " + eventos.get(i).getText() + " - "
+ eventos_link.get(i).getAttribute("href"));
Thread.sleep(500);
}
}
}
这是因为您不再阅读链接。 每次单击按钮都会创建一个新页面,因此您需要再次阅读它们。
此外,您需要存储最后获取的链接。
因此,在等待按钮再次可点击后,您需要重新阅读eventos
和eventos_link
。 也许您使用像lastFetchedLinkIndex
这样的全局变量。
这将是我的方法(调整你的代码):
WebDriverManager.firefoxdriver().setup();
WebDriver driver = new FirefoxDriver();
driver.manage().window().maximize();
driver.get("https://www.sympla.com.br/eventos?ts=online_mais-de-3-mil-eventos-online");
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
// If have captcha, close the page and exit.
boolean captcha = driver.getPageSource().contains("Não sou um robô");
if (captcha == true) {
System.out.println("O Captcha apareceu, acabou a brincadeira!");
driver.close();
driver.quit();
}
// load more button
WebElement CarregarMais = driver.findElement(By
.xpath("//button[@id='more-events']"));
// Number of events counter
List<WebElement> eventos = (List<WebElement>) driver.findElements(By
.cssSelector("div.event-name.event-card"));
System.out.println("Number of links: " + eventos.size());
// Number of links counter
List<WebElement> eventos_link = (List<WebElement>) driver
.findElements(By.cssSelector("a.sympla-card.w-inline-block"));
int lastEventScraped = 0;
// iterating over the button more events
for (int j = 0; j < eventos.size(); j++) {
CarregarMais.click();
@SuppressWarnings("deprecation")
WebDriverWait wait = new WebDriverWait(driver, 10);
WebElement element = wait.until(ExpectedConditions
.elementToBeClickable(By
.xpath("//button[@id='more-events']")));
eventos = (List<WebElement>) driver.findElements(By
.cssSelector("div.event-name.event-card"));
eventos_link = (List<WebElement>) driver
.findElements(By.cssSelector("a.sympla-card.w-inline-block"));
// Iterating over event links
for (int i = lastEventScraped; i < eventos_link.size(); i++, lastEventScraped++) {
System.out.println(i + " " + eventos.get(i).getText() + " - "
+ eventos_link.get(i).getAttribute("href"));
Thread.sleep(500);
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.