简体   繁体   English

Java 中的 HTMLUnit - 如何导航到 GridView 页面

[英]HTMLUnit in Java - How to navigate to GridView pages

I'm trying to create an application using java that will read the info from a webpage.我正在尝试使用 java 创建一个应用程序,该应用程序将从网页中读取信息。 In order to download the info from the elements that I want I used jsoup (excellent tool!) but I want to load the next page of the GridView used in the webpage.为了从我想要的元素下载信息,我使用了 jsoup(优秀的工具!)但我想加载网页中使用的 GridView 的下一页。 The page is an .aspx page and the link of the 2nd page is like that:该页面是一个 .aspx 页面,第二页的链接是这样的:

 <a href="javascript:__doPostBack('GridView1','Page$2')" style="color:White;">2</a>

Below is the javascript function used:下面是使用的javascript函数:

    //<![CDATA[
    var theForm = document.forms['form1'];
    if (!theForm) {
        theForm = document.form1;
    }
    function __doPostBack(eventTarget, eventArgument) {
        if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
            theForm.__EVENTTARGET.value = eventTarget;
            theForm.__EVENTARGUMENT.value = eventArgument;
            theForm.submit();
        }
    }
    //]]>

Currently, I am trying to do it using HTMLUnit but looks like is not working.目前,我正在尝试使用 HTMLUnit 来完成它,但看起来不起作用。 Below is the code I am using:下面是我正在使用的代码:

 final WebClient webClient = new WebClient(BrowserVersion.CHROME);
            HtmlPage page = webClient.getPage("http://www.webpage.com/Main.aspx");          
            HtmlAnchor anchor = null;
            List<HtmlAnchor> anchors = page.getAnchors();
            for (int j = 0; j < anchors.size(); j++)
            {
                anchor = anchors.get(j);
                String sAnchor = anchor.asText();               
                String sAnchorxml = anchor.asXml();         
                if (sAnchor.equals("2"))
                {
                    HtmlPage page2 = anchor.click();
                    doc = Jsoup.parse(page2.asXml());
                    .....

When I read the page using the same code that I read the 1st page I get the following error:当我使用与阅读第一页相同的代码阅读页面时,出现以下错误:

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(Unknown Source)
at java.util.ArrayList.get(Unknown Source)
at test.advacus.com.MainProgram.main(MainProgram.java:148)

I assume that my error is in the 'Jsoup.parse()' line.我假设我的错误在'Jsoup.parse()'行。 Just to clarify that once you click on the next page the url is not changing, only the info in the GridView, so I cannot parse using the new url.只是为了澄清一下,一旦您单击下一页,网址就不会更改,只有 GridView 中的信息,因此我无法使用新网址进行解析。

Any additional help or any suggested tool instead of HTMLUnit that will cooperate with jsoup better would really help!任何额外的帮助或任何建议的工具而不是 HTMLUnit 将更好地与 jsoup 合作将真正有帮助! Thank you in advance!先感谢您!

Edited for Additional info: Looks like is click() that is not working... I modified the code and the newPage body looks like it contains the same info as the 1st page:编辑附加信息:看起来是click()不起作用...我修改了代码,newPage 正文看起来包含与第一页相同的信息:

final WebClient webClient = new WebClient(BrowserVersion.CHROME);       
HtmlPage page = webClient.getPage("http://www.qatarsale.com/EnMain.aspx");                  
HtmlAnchor anchor = page.getAnchorByText("2");              
HtmlPage newPage = anchor.click();      
HtmlElement el = newPage.getBody();
System.out.println(el.asText());

Inspecting the anchors - as you already pointed out - doPostBack is called, so it is much simpler to invoke the javascript call instead of first grabbing the anchors and calling click on it.检查锚点 - 正如您已经指出的 - doPostBack被调用,因此调用 javascript 调用要简单得多,而不是首先抓取锚点并调用它。

Example code示例代码

java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
final WebClient webClient = new WebClient(BrowserVersion.CHROME);

webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setTimeout(10000);

try {
    HtmlPage htmlPage = webClient.getPage("http://qatarsale.com/EnMain.aspx");

    Document doc = Jsoup.parse(htmlPage.asXml());

    System.out.println(doc.select("[id$=Label10]").text());

    ScriptResult result = htmlPage.executeJavaScript("__doPostBack('GridView1','Page$2')");
    htmlPage = (HtmlPage)result.getNewPage();

    Thread.sleep(3000); // delay needed for lazy loading, there might be something cleaner

    doc = Jsoup.parse(((HtmlPage)htmlPage).asXml());

    System.out.println(doc.select("[id$=Label10]").text());

} catch (Exception e) {
    e.printStackTrace();
} finally {
    webClient.close();
}

Output输出

Toyota Porsche Mercedes-Benz Cadillac Jeep Porsche Porsche Nissan Mitsubishi BMW Porsche Ford Mitsubishi Toyota Nissan Land Rover Nissan Mercedes-Benz Nissan Nissan Toyota Toyota Porsche Mitsubishi Mitsubishi Nissan Nissan Mercedes-Benz Nissan Jeep Mercedes-Benz Lexus BMW Lexus
BMW Lexus Toyota Toyota Lexus Nissan Mercedes-Benz Mercedes-Benz Ferrari Dodge BMW Mercedes-Benz Aston Martin Mitsubishi Suzuki Maserati Porsche Maserati Land Rover Chevrolet Land Rover GMC Toyota Porsche Lexus Land Rover GMC Mercedes-Benz Toyota Lexus Toyota Lexus Toyota Nissan

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM