简体   繁体   中英

HTML characters from a downloaded page dont appears correctly

Some pages have HTML special characters in their content, but they are appearing as a square (an unknown character).

What can I do?

Can I convert the String containg the carachters to another format(UTF-8)? It's in the conversion from InputStream to String that happens this. I really don't know what causes it.

public HttpURLConnection openConnection(String url) {
    try {
        URL urlDownload = new URL(url);
        HttpURLConnection con = (HttpURLConnection) urlDownload.openConnection();
        con.setInstanceFollowRedirects(true);
        con.connect();
        return con;
    } catch (Exception e) {
        return null;
    }
}

private String getContent(HttpURLConnection con) {
    try {
        return IOUtils.toString(con.getInputStream());
    } catch (Exception e) {
        System.out.println("Erro baixando página: " + e);
        return null;
    }
}

page.setContent(getContent(openConnection(con)));

You need to read the InputStream using InputStreamReader with the charset as specified in the Content-Type header of the downloaded HTML page. Otherwise the platform default charset will be used, which is apparently not the same as the HTML's one in your case.

Reader reader = new InputStreamReader(input, "UTF-8");
// ...

You can of course also use a HTML reader/parser like Jsoup which takes this automatically into account.

String html = Jsoup.connect("http://stackoverflow.com").get().html();

Update : as per your updated question, you seem to be using URLConnection to request the HTML page and IOUtils to convert InputStream to String . You need to use it as follows:

String contentType = connection.getHeaderField("Content-Type");
String charset = "UTF-8"; // Default to UTF-8
for (String param : contentType.replace(" ", "").split(";")) {
    if (param.startsWith("charset=")) {
        charset = param.split("=", 2)[1];
        break;
    }
}

String html = IOUtils.toString(input, charset);

If you're still having problems with getting the characters right, then it can only mean that the console/viewer where you're printing those characters to doesn't support the charset. Eg, when you run the following in Eclipse

System.out.println(html);

Then you need to ensure that the Eclipse console uses UTF-8. You can set it by Window > Preferences > General > Workspace > Text File Encoding .

Or if you're writing it to some file by FileWriter , then you should rather be using InputStream / OutputStream from the beginning on without converting it to String first. If converting to String is really an important step, then you need to write it to new OutputStreamWriter(output, "UTF-8") .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM