简体   繁体   中英

HtmlUnit getting string with wrong encoding

I'm using HtmlUnit to execute some Javascript in a HTML file. The point is the Javascript can be anything, such as document.querySelector() .

When running a document.querySelector() through executeJavaScript() to obtain string data from HTML, it is messing around the encoding.

For example: Interés becomes Interés .

Is there a clever way to convert it configuring HtmlUnit objects?

Some code:

webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setAppletEnabled(false);
webClient.getOptions().setDownloadImages(false);

htmlPage = this.webClient.getPage("file:/" + htmlFile.getAbsolutePath());

ScriptResult scriptResult = htmlPage.executeJavaScript(someJavascriptFunction);

//This scriptResult.getJavaScriptResult() already has encoding issues

I have tried to set webClient.addRequestHeader("Accept-Encoding", "utf-8"); but it doesn't work.

The problem here is the file source. There is information about the used encoding when reading a plain file from the disk. HtmlUnit handles this case the same way as if the web server does not provide any encoding information as part of the response. In these cases HtmlUnit (like real browsers) reads the file bytes using the StandardCharsets.ISO_8859_1 encoding.

As simple solution write your file ISO_8859_1 encoded.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM