简体   繁体   中英

Error while using HtmlUnit

When I execute this simple code to get the contents of a website as text, it shows errors which I can't understand.

import java.io.IOException;
import java.net.MalformedURLException;

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.ScriptException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class sd {
    public static void main(String[] args) {
        sd vip=new sd();
        try {
            vip.homePage();
        } catch (Exception e) {
            e.printStackTrace();
        }

        System.out.print("sssss");
    }

    public void homePage() throws Exception, ScriptException {
        final WebClient webClient = new WebClient();
        final HtmlPage page =       
    (HtmlPage)webClient.getPage("http://timesofindia.indiatimes.com/");
        String pageAsText = page.asText();
        String pageAsXML = page.asXml();

        // System.out.println(pageAsXML);
        System.out.println("////////////////////output//////////////////////////"); 
        System.out.println(pageAsText);
        // System.out.println(pageAsXML);
        System.out.println("////////////////////output ends//////////////////////////"); 
    }

}

Error that I get:

======= EXCEPTION START ========
Exception class=[com.gargoylesoftware.htmlunit.ScriptException]
com.gargoylesoftware.htmlunit.ScriptException: Exception invoking jsxFunction_write
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:595)
Caused by: java.lang.RuntimeException: Exception invoking jsxFunction_write
Caused by: com.gargoylesoftware.htmlunit.ScriptException: Exception invoking jsxFunction_write
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:595)

The WebClient::setThrowExceptionOnScriptError method is deprecated since the HtmlUnit version 2.11. Use the following within newer versions:

webClient.getOptions().setThrowExceptionOnScriptError(false);

set your webClient to not throw javascript exceptions

webClient.setThrowExceptionOnScriptError(false);

If not enougth, set FF as client behavior when initializing your webclient.

webClient = new WebClient(BrowserVersion.FIREFOX_3_6); webClient = new WebClient(BrowserVersion.FIREFOX_10); // depending on HtmlUnit version

Even I had this error. This option of setting WebClient to suppress errors works for basic websites. But as the website becomes complex, it literally fails

After multiple trials, I finally had to choose Phantomjs . It is written in C++. I had to write some scripts and then execute it using phantomjs. The script would load the url and write the data to a file.

Once that file is ready, I would write a java program to load the file data and then do my operations on that file. For loading and scraping through the data, I had used Jsoup .

As you can see, HtmlUnit, Jaunt, Jsoup support full HTML, CSS. What they are missing is that they do not support Javascript completely. That is the main reason of errors such as Exceptions thrown, complete page not getting loaded and so on..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM