简体   繁体   中英

HTMLUnit in Java - How to navigate to GridView pages

I'm trying to create an application using java that will read the info from a webpage. In order to download the info from the elements that I want I used jsoup (excellent tool!) but I want to load the next page of the GridView used in the webpage. The page is an .aspx page and the link of the 2nd page is like that:

 <a href="javascript:__doPostBack('GridView1','Page$2')" style="color:White;">2</a>

Below is the javascript function used:

    //<![CDATA[
    var theForm = document.forms['form1'];
    if (!theForm) {
        theForm = document.form1;
    }
    function __doPostBack(eventTarget, eventArgument) {
        if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
            theForm.__EVENTTARGET.value = eventTarget;
            theForm.__EVENTARGUMENT.value = eventArgument;
            theForm.submit();
        }
    }
    //]]>

Currently, I am trying to do it using HTMLUnit but looks like is not working. Below is the code I am using:

 final WebClient webClient = new WebClient(BrowserVersion.CHROME);
            HtmlPage page = webClient.getPage("http://www.webpage.com/Main.aspx");          
            HtmlAnchor anchor = null;
            List<HtmlAnchor> anchors = page.getAnchors();
            for (int j = 0; j < anchors.size(); j++)
            {
                anchor = anchors.get(j);
                String sAnchor = anchor.asText();               
                String sAnchorxml = anchor.asXml();         
                if (sAnchor.equals("2"))
                {
                    HtmlPage page2 = anchor.click();
                    doc = Jsoup.parse(page2.asXml());
                    .....

When I read the page using the same code that I read the 1st page I get the following error:

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(Unknown Source)
at java.util.ArrayList.get(Unknown Source)
at test.advacus.com.MainProgram.main(MainProgram.java:148)

I assume that my error is in the 'Jsoup.parse()' line. Just to clarify that once you click on the next page the url is not changing, only the info in the GridView, so I cannot parse using the new url.

Any additional help or any suggested tool instead of HTMLUnit that will cooperate with jsoup better would really help! Thank you in advance!

Edited for Additional info: Looks like is click() that is not working... I modified the code and the newPage body looks like it contains the same info as the 1st page:

final WebClient webClient = new WebClient(BrowserVersion.CHROME);       
HtmlPage page = webClient.getPage("http://www.qatarsale.com/EnMain.aspx");                  
HtmlAnchor anchor = page.getAnchorByText("2");              
HtmlPage newPage = anchor.click();      
HtmlElement el = newPage.getBody();
System.out.println(el.asText());

Inspecting the anchors - as you already pointed out - doPostBack is called, so it is much simpler to invoke the javascript call instead of first grabbing the anchors and calling click on it.

Example code

java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
final WebClient webClient = new WebClient(BrowserVersion.CHROME);

webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setTimeout(10000);

try {
    HtmlPage htmlPage = webClient.getPage("http://qatarsale.com/EnMain.aspx");

    Document doc = Jsoup.parse(htmlPage.asXml());

    System.out.println(doc.select("[id$=Label10]").text());

    ScriptResult result = htmlPage.executeJavaScript("__doPostBack('GridView1','Page$2')");
    htmlPage = (HtmlPage)result.getNewPage();

    Thread.sleep(3000); // delay needed for lazy loading, there might be something cleaner

    doc = Jsoup.parse(((HtmlPage)htmlPage).asXml());

    System.out.println(doc.select("[id$=Label10]").text());

} catch (Exception e) {
    e.printStackTrace();
} finally {
    webClient.close();
}

Output

Toyota Porsche Mercedes-Benz Cadillac Jeep Porsche Porsche Nissan Mitsubishi BMW Porsche Ford Mitsubishi Toyota Nissan Land Rover Nissan Mercedes-Benz Nissan Nissan Toyota Toyota Porsche Mitsubishi Mitsubishi Nissan Nissan Mercedes-Benz Nissan Jeep Mercedes-Benz Lexus BMW Lexus
BMW Lexus Toyota Toyota Lexus Nissan Mercedes-Benz Mercedes-Benz Ferrari Dodge BMW Mercedes-Benz Aston Martin Mitsubishi Suzuki Maserati Porsche Maserati Land Rover Chevrolet Land Rover GMC Toyota Porsche Lexus Land Rover GMC Mercedes-Benz Toyota Lexus Toyota Lexus Toyota Nissan

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM