简体   繁体   中英

Scraping aspx page using htmlUnit

I'm trying to write a program to access the page http://www.bmfbovespa.com.br/cias-listadas/empresas-listadas/BuscaEmpresaListada.aspx?Idioma=pt-br , and on the page click the button 'todas'.

Is expected as result a table with the name of many companies, but I don't know why I don't get it.

My code:

package xx;

import java.io.IOException;
import java.net.MalformedURLException;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class teste {

    public static void main(String args[]) throws FailingHttpStatusCodeException, MalformedURLException, IOException
    {
       HtmlPage page = null;
       String url = "http://www.bmfbovespa.com.br/cias-listadas/empresas-listadas/BuscaEmpresaListada.aspx?Idioma=pt-br";

       WebClient webClient = new WebClient(BrowserVersion.FIREFOX_17);

       webClient.getOptions().setThrowExceptionOnScriptError(false);
       webClient.getOptions().setCssEnabled(false);
       webClient.getOptions().setJavaScriptEnabled(false);
       webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
       webClient.getOptions().setTimeout(30000);

       page = webClient.getPage( url );

       System.out.println("Current page: Empresas Listadas | BM&FBOVESPA");

       HtmlElement theElement1 = (HtmlElement) page.getElementById("ctl00_contentPlaceHolderConteudo_BuscaNomeEmpresa1_btnTodas");
       page = theElement1.click();

       System.out.println(page.asText());

       System.out.println("Test has completed successfully");
    }

}

After taking a glance at that page I noticed it is using AJAX to get the data. You don't seem to be performing any waiting for the data to come and that might be the issue.

You should first take a look at the HTMLUnit FAQ .

And then, probably this question might help for a concrete example on how to do that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM