简体   繁体   English

HtmlUnit 返回 DomElements 的空列表

[英]HtmlUnit returning empty list of DomElements

I am having trouble retrieving the list of Dom Elements when using the method getElementsByName from HtmlPage .使用HtmlPage中的getElementsByName方法时,我无法检索 Dom 元素列表。

Here is the HTML Page.这是 HTML 页面。 (Trying to get the CategoriaAgente from the select tag). (试图从select标签中获取CategoriaAgente )。

HTML (The part that I need): HTML(我需要的部分):

<select name="CategoriaAgente">
  <option value="-">Escolha uma categoria</option>
  <option value="t">Todos</option>
  <option value="p">Permissionária de Distribuição</option>
  <option value="d">Concessionária de Distribuição</option>
</select>

Snippet of the Java code (Using HtmlUnit): Java 代码片段(使用 HtmlUnit):

    public List<HtmlOption> listaAgentes() {
    List<HtmlOption> listaAgentes = null;

    try (WebClient webClient = new WebClient()) {
        log.info("COLETANDO AGENTES");

        // parâmetros do webclient
        webClient.setJavaScriptTimeout(15000);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setTimeout(300000);

        String url = "https://www2.aneel.gov.br/aplicacoes_liferay/tarifa/";
        HtmlPage page = webClient.getPage(url);
        
        // SELECIONAR CATEGORIA AGENTE
        List<DomElement> listaCategoriaAgente = page.getElementsByName("CategoriaAgente");
       //... 

The list listaCategoriaAgente is ALWAYS empty.列表listaCategoriaAgente总是空的。 I tried some solutions found on SO but none of them works.我尝试了一些在 SO 上找到的解决方案,但没有一个有效。 Help?帮助? Thanks in advance!提前致谢!

EDIT : After the comment from @hooknc, I found that the page is looking for some kind of captcha from cloudfare.编辑:@hooknc 发表评论后,我发现该页面正在寻找来自 cloudfare 的某种验证码。 This is what I get from POSTMAN....这是我从 POSTMAN 那里得到的……

在此处输入图像描述

Someone knows how to bypass this challenge-form using HtmlUnit?有人知道如何使用 HtmlUnit 绕过这个challenge-form吗? Thanks!!!!!谢谢!!!!!

EDIT 2:编辑 2:

Well, I think I made some progress(?)...好吧,我想我取得了一些进步(?)...

This is the code so far....到目前为止,这是代码......

try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
        webClient.getOptions().setCssEnabled(false);
        webClient.setJavaScriptTimeout(0);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setTimeout(0);
        webClient.getCookieManager().setCookiesEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getCache().setMaxSize(0);
        webClient.waitForBackgroundJavaScript(10_000);
        webClient.waitForBackgroundJavaScriptStartingBefore(10_000);

        HtmlPage page = null;
        String url = null;

        url = "https://www2.aneel.gov.br/aplicacoes_liferay/tarifa/";
        page = webClient.getPage(url);

        if (page.asXml().contains("Checking if the site connection is secure")) {
            log.info(page.asXml());

            synchronized(page) {
                page.wait(10_000);
            }
            webClient.waitForBackgroundJavaScript(10_000);
        }

And... this is what I get from the log...而且...这就是我从日志中得到的...

<div id="challenge-success" style="display: none;">
      <div class="h2">
        <span class="icon-wrapper">
          <img class="heading-icon" alt="Success icon" src=""/>
        </span>
        Connection is secure
      </div>
      <div class="core-msg spacer">
        Proceeding...
      </div>
    </div>

So... It says Proceeding... but nothing happens... I waited 4ever, but it just stucks on the Proceeding ...所以...它说进行Proceeding...但什么也没发生...我等了Proceeding ,但它只是卡在了进行中...

Any thoughts??有什么想法吗?? Thanks!!!谢谢!!!

Well.出色地。 This is what happened.这就是发生的事情。 I posted (a related) question , and a guy (possibly from the htmlunit crew) posted an update on git to solve the cookie problem.我发布了(一个相关的) 问题,一个人(可能来自 htmlunit 工作人员)在 git 上发布了一个更新来解决 cookie 问题。 When using that updated version ( 2.68.0-SNAPSHOT - and I had to update the version of apache-commons-lang3 too) all the problems disappeared.使用该更新版本( 2.68.0-SNAPSHOT - 我也必须更新apache-commons-lang3的版本)时,所有问题都消失了。 Cloudflare accepted the connection and everything worked. Cloudflare接受了连接,一切正常。 Here is the final version of the code....这是代码的最终版本....

try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
        String url = "https://www2.aneel.gov.br:443/aplicacoes_liferay/tarifa/";
        
        // parâmetros do webclient
        webClient.getOptions().setCssEnabled(true);
        webClient.setJavaScriptTimeout(0);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setTimeout(0);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setRedirectEnabled(true);
        
        CookieManager cookies = new CookieManager();            
        cookies.setCookiesEnabled(true);
        webClient.setCookieManager(cookies);
        
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        
        webClient.waitForBackgroundJavaScript(10000);
        webClient.waitForBackgroundJavaScriptStartingBefore(10000);
        
        webClient.getCache().setMaxSize(0);
        
        java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
        java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
        java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
        
        HtmlPage page = webClient.getPage(url);
        webClient.getRefreshHandler().handleRefresh(page, new URL(url), 10);
        
        synchronized(page) {
            page.wait(10000);
        }
        
        if (page.asXml().contains("Checking if the site connection is secure")) {
            log.info(page.asXml());
            webClient.waitForBackgroundJavaScript(10_000);
        }

        List<DomElement> listaCategoriaAgente = page.getElementsByName("CategoriaAgente");

With the updates, and this piece of code, the list of DOM Elements I needed came properly.通过更新和这段代码,我需要的 DOM 元素列表正确地出现了。 Thank you all for the assist!谢谢大家的协助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM